6.1 Basic operations on tibbles

6.1.1 Check Unique Values

n_distinct(x) This is a faster and more concise equivalent of length(unique(x)), # of unique values in x.

distinct(df, ..., .keep_all=FALSE) select distinct/unique rows, remove duplicate rows.

df: table
...: variables to use when determining uniqueness. If omitted, will use all variables.

If there are multiple rows for a given combination of inputs, only the first row will be preserved.

If two variables are provided, then unique combinations of the two are used as key and the first row of each key is preserved.
.keep_all: If TRUE, keep all variables in df

# unique locations
data %>% 
    distinct(lat, lon)

dplyr::setdiff(x, y) element that is in x but not in y.

x and y are supposed to have the same structure, i.e., same columns if for data frames.

6.1.2 Column Names

rename() replaces an old name with a new one.

world %>% 
  rename(name = name_long)
# renames the lengthy name_long column to simply name

magrittr package provides a series of aliases which can be more pleasant to use when composing chains using the %>%operator.

Alias	Cmd
`set_colnames`	colnames<-
`set_rownames`	rownames<-
`set_names`	names<-

set_names() changes all column names at once, and requires a character vector with a name matching each column.

6.1.3 Column Operations

Create New Columns

add_column(df, ..., .after=NULL, .before=NULL) add new column at specified position.

Note that add_column() does NOT support data-masking, so you cannot refer to other columns when creating new columns.

→ Use mutate() instead.

df: Data frame to append to;
...: Name-value pairs to insert;
.before, .after: One-based column index or column name where to add the new columns, default: after last column

Q: What is data-masking?

A: Data-masking is a technique that allows you to refer to variables in a data frame directly by their names without using the $ operator or quoting the variable names. See examples below.

# Unmasked programming
mean(mtcars$cyl + mtcars$am)
#> [1] 6.59375

# Referring to columns is an error - Where is the data?
mean(cyl + am)
#> Error:
#> ! object 'cyl' not found

# Data-masking
with(mtcars, mean(cyl + am))
#> [1] 6.59375

Q: Which functions support data-masking?

A: dplyr verbs, such as filter(), mutate(), summarise(), arrange(), and select() support data-masking.

library(tidyverse)
DT <- tibble(a = rnorm(1000), b = rnorm(1000))

DT %>% mutate(id = row_number())
DT %>% add_column(id = row_number())
Error in `n()`:
! Must only be used inside data-masking verbs like `mutate()`, `filter()`, and `group_by()`.
Run `rlang::last_trace()` to see where the error occurred.

DT %>% mutate(diff = b - a)
DT %>% add_column(diff = b - a)
Error: object 'b' not found

data %>% mutate(column = .[[2]] - .[[1]]) subset by column positions.

Here the dot notation in .[[2]] refers to data, the variable you pipe into mutate. Dot is extra useful here because it allows you to use data multiple times. We use it twice in this example.
Alternatively, data %>% mutate(column = unlist(pick(2) - pick(1))) unlist here transform the list generated from pick(2)-pick(1) to a vector.

Order by Columns

arrange() reorders data frame based on specified columns.

# sort mtcars by disp in ascending order
arrange(mtcars, disp)
# sort mtcars by disp in descending order
arrange(mtcars, desc(disp))

arrange by column position

# sort by first column in the data
data %>% arrange(.[1])

6.1.4 `mutate`

mutate() adds new columns at the penultimate (second last) position in the sf object (the last one is reserved for the geometry):

world %>% 
  mutate(pop_dens = pop / area_km2)

To calculate row means of multiple columns,

# common mistake: use mean() directly
> X_demo %>% 
    mutate(ptf_mean = mean(Asset_A, Asset_B))

Error in `mutate()`:
ℹ In argument: `ptf_mean = mean(Asset_A, Asset_B)`.
Caused by error in `mean.default()`:
! 'trim' must be numeric of length one
Run `rlang::last_trace()` to see where the error occurred.

Note that the error happens because mean(Asset_A, Asset_B) is interpreted as mean(c(Asset_A), trim = Asset_B), which is not what we want.

Fixes:

Easiest way: Use rowMeans(X_demo[c("Asset_A", "Asset_B")]) to calculate row means of selected columns.

Returns a vector.
If want to use mutate; this creates a new column; looks better as ptf and assets returns are side by side;
```
X_demo %>% 
  mutate(ptf_mean = rowMeans(select(., Asset_A, Asset_B)))
```

mutate_at(.tbl, .vars, .funs) applies a function to given columns:

.tbl A tbl object;
.vars A list of columns generated by vars(), a character vector of column names, a numeric vector of column positions, or NULL.
.fun A function fun, a quosure style lambda ~ fun(.) or a list of either form.

mtcars <- mtcars %>%
  mutate_at(c("hp_wt", "mpg_wt"), log) 

# note that `across` use together with `mutate`
mtcars <- mtcars %>%
  mutate(across(c("hp_wt", "mpg_wt"), log)) 

# `mutate_at` is more concise, don't need to put everything in `across(.cols, .fnc)`
# factor as numeric, except for `ISO_C3`
agg_dummy %>% 
    mutate_at(vars(-ISO_C3), ~as.numeric(levels(.)[.]))

across(.cols, .fnc, .names=NULL) apply the same transformation to multiple columns, allowing you to use select() semantics inside in “data-masking” functions like summarise() and mutate().

.cols Columns to transform. You can NOT select grouping columns because they are already automatically handled by the verb.

Can specify start and end columns using :.
.fnc Functions to apply to each of the selected columns. Possible values are:
- A function, e.g. mean.
- A purrr-style lambda, e.g. ~ mean(.x, na.rm = TRUE)
```
# divide each col by 100, except for the Date column.
FF_factor %>% mutate_at(vars(-Date), ~./100)
```
- A named list of functions or lambdas, e.g. ⁠list(mean = mean, n_miss = ~ sum(is.na(.x))⁠. Each function is applied to each column, and the output is named by combining the function name and the column name using the glue specification in .names.

mutate() and mutate_at() outputs a new data frame, it does not alter the given data frame. We need to reassign the data frame to be the output of the pipe.

mutate_all(.tbl, .funs, ...) is equivalent to apply(x, 2, FUN). But apply() works poor when column types are not unanimous. E.g., when a tibble has character and numeric columns, apply() tends to coerce numeric to character, and won’t return the result as you expext.

6.1.5 Concatenate rows into a `tibble`

add_row(.data, ..., .before = NULL, .after = NULL) add one or more rows of data to an existing data frame,

convenient in the way that you can just specify each column with their values.
especially convenient when you just want to add one row.
... <dynamic-dots> Name-value pairs, passed on to tibble().
- Values can be defined only for columns that already exist in .data and;
- unset columns will get an NA value.
.before, .after specify the position where you want to add the new row/rows.

# add_row ---------------------------------
df <- tibble(x = 1:3, y = 3:1)
df %>% add_row(x = 4, y = 0)

bind_rows(..., .id = NULL) This is an efficient implementation of the common pattern of do.call(rbind, dfs). Match by column names.

The output of bind_rows() will contain a column if that column appears in any of the inputs. rbind will throw errors if columns do not match.

... Data frames to combine.
- Each argument can either be a data frame, a list that could be a data frame, or a list of data frames.
- When row-binding, columns are matched by name, and any missing columns will be filled with NA. So no column is dropped, which is safe.
- When column-binding (bind_cols(df1, df2, ...)), rows are matched by position, so all data frames must have the same number of rows. To match by value, not position, see mutate-joins.
  
  Need to be careful when you use bind_cols, make sure rows are in the same order in the tables you want to join.
  - bind_cols is equivalent to cbind: match by position. Recommend to use left_join, which is safer.
  - bind_rows is safer than rbind: bind_rows find matched col names
.id Data frame identifier.
- When .id is supplied, a new column of identifiers is created to link each row to its original data frame. The labels are taken from the named arguments to bind_rows(). When a list of data frames is supplied, the labels are taken from the names of the list. If no names are found a numeric sequence is used instead.

res = NULL
for (i in tibbleList)
   res = bind_rows(res, i)

# or, equivalently,
bind_rows(tibbleList) # combine all tibbles in the list

6.1.6 Data Subsetting

pull(df, var) pull out a single variable and return a vector. Similar to $, but works well with %>%.

var: A variable specified as:

a literal variable name
a positive integer, giving the position counting from the left
a negative integer, giving the position counting from the right.

filter(.data, ..., .preserve=FALSE) chooses rows/cases where conditions are true. Unlike base subsetting with [, rows where the condition evaluates to NA are dropped.

... <data-masking> Expressions that return a logical value, and are defined in terms of the variables in .data. If multiple expressions are included, they are combined with the & operator. Only rows for which all conditions evaluate to TRUE are kept.
.preserve Relevant when the .data input is grouped. If .preserve = FALSE (the default), the grouping structure is recalculated based on the resulting data, otherwise the grouping is kept as is.

# Countries with a life expectancy longer than 82 years
world6 = filter(world, lifeExp > 82)
# filter based on vector
filter(diamonds, cut %in% c('Ideal', 'Premium'))

Useful filter functions:

==, >, >= etc
&, |, !, xor()
is.na()
between(), near()

between(x, left, right) is a shortcut for x >= left & x <= right.

Ex. x[between(x, -1, 1)]

select() subsets by column, slice subsets by rows. select() selects columns by name or position.

world1 = dplyr::select(world, name_long, pop)
names(world1)
#> [1] "name_long" "pop"       "geom"

select() also allows subsetting of a range of columns with the help of the : operator:

# all columns between name_long and pop (inclusive)
world2 = dplyr::select(world, name_long:pop)

select() can be used to reorder/drop variables.

select(df, year, var, state)   # reverses the columns
select(df, -state) # drop column by name 'state'

If you only know you want var in the front and don’t care about the order of the rest, you can do [move one variable in the front]

df %>%
  select(var, everything())

all_of(vars) is used together with select for strict selection. If any of the variables in the character vector is missing, an error is thrown.

vars A vector of character names or numeric locations.

any_of(vars) doesn’t check for missing variables. It is especially useful with negative selections, when you would like to make sure a variable is removed.

# select columns may or may not exist
the_country %>% select(any_of(c("isoa2", "countryCode")) )
# remove columns may or may not exist
the_country %>% select(-any_of(c("isoa2", "countryCode")) )

This flexibility is useful because it won’t return an error if the variable is not found.

Selection with conditions

Functions work together with select to choose cols matching certain conditions: starts_with(), ends_with(), contains(). These are selection helpers which match variables according to a given pattern.

> iris %>% select(starts_with("Sepal"))
#> # A tibble: 150 x 2
#>   Sepal.Length Sepal.Width
#>          <dbl>       <dbl>
#> 1          5.1         3.5
#> 2          4.9         3  
#> 3          4.7         3.2
#> 4          4.6         3.1
#> # i 146 more rows

> iris %>% select(ends_with("Width"))
#> # A tibble: 150 x 2
#>   Sepal.Width Petal.Width
#>         <dbl>       <dbl>
#> 1         3.5         0.2
#> 2         3           0.2
#> 3         3.2         0.2
#> 4         3.1         0.2
#> # i 146 more rows

6.1.7 Indirection

Indirection is when you want to get the data-variable from an env-variable instead of directly typing the data-variable’s name. There are two main cases:

When you have the data-variable in a function argument, you need to embrace the argument by surrounding it in doubled braces, like filter(df, {{ var }}).

var_summary <- function(data, var) {
data %>%
  summarise(n = n(), min = min({{ var }}), max = max({{ var }}))
}
mtcars %>% 
  group_by(cyl) %>% 
  var_summary(mpg)

When you have an env-variable that is a character vector, you need to index into the .data pronoun with [[.
```
for (var in names(mtcars)) {
  mtcars %>% count(.data[[var]]) %>% print()
}
```
Note that .data is not a data frame; it’s a special construct, a pronoun, that allows you to access the current variables either directly, with .data$x or indirectly with.data[[var]].

ref: Programming with dplyr: Indirection

6.1.8 Dynamic Selection

A dynamic subset of variables when using select

dynamic_var <- 'state'
df %>% select(year, var, eval(dynamic_var)) # dynamic_var will be parsed as state

# --- dplyr version 0.7+---
multipetal <- function(df, n) {
    varname <- paste("petal", n , sep=".")
    mutate(df, !!varname := Petal.Width * n)
}
# !! unquote
# using := to dynamically assign/change parameter names

variables in the left hand side

plot_v <- "maize"
plot_data <- plot_data %>% 
    select(year, plot_v, tmx, pre, rad) %>% 
    mutate(!!plot_v := log(eval(parse(text = plot_v)) )) %>% 
    group_by(year) %>% 
    summarise(across(everything(), ~mean(.x, na.rm = TRUE) ) )
plot_data

Use !! to unquote a single argument in a function call. !! takes a single expression, evaluates it, and inlines the result in the AST.

x <- expr(-1)
expr(f(!!x, y))
#> f(-1, y)

rlang::sym(x) take a string as input and turn it into symbols

string x -> expression expr(x) -> evaluate!!expr(x)

!!! the behaviour of !!! is known as “spatting” in Ruby, Go, PHP, and Julia. It is closely related to *args (star-args) and **kwarg (star-star-kwargs) in Python, which are sometimes called argument unpacking.

:= rather than interpreting var literally (pronounced colon-equals), we want to use the value stored in the variable called var.

tibble::tibble(!!var := val)
#> # A tibble: 3 x 1
#>       x
#>   <dbl>
#> 1     4
#> 2     3
#> 3     9

Note the use of := (pronounced colon-equals) rather than =. Unfortunately we need this new operation because R’s grammar does not allow expressions as argument names:

:= is like a vestigial organ: it’s recognized by R’s parser, but it doesn’t have any code associated with it. It looks like an = but allows expressions on either side, making it a more flexible alternative to =. It is used in data.table for similar reasons.

ref:

Programming with dplyr: Name injection

SE-versions of dplyr verbs

https://dplyr.tidyverse.org/reference/se-deprecated.html

dplyr used to offer twin versions of each verb suffixed with an underscore. These versions had standard evaluation (SE) semantics: rather than taking arguments by code, like NSE verbs, they took arguments by value. Their purpose was to make it possible to program with dplyr.

However, dplyr now uses tidy evaluation semantics. NSE verbs still capture their arguments, but you can now unquote parts of these arguments. This offers full programmability with NSE verbs. Thus, the underscored versions are now superfluous.

gcm_name <- "CanESM5"
mergeData %>% filter(get(gcm_name)<35 )
mergeData %>% filter_(sprintf( "%s<35", gcm_name ) ) # SE version
mergeData %>% select_(gcm_name) # SE version

select_(.data, .dots=list() )

.data A data frame.
.dots, ... Pair/values of expressions coercible to lazy objects.

vars <- list(list('cyl', 'mpg'), list('vs', 'disp'))
for (v in vars) {
  print(mtcars %>% select_(.dots = v) %>% head)
}
                  cyl  mpg
Mazda RX4           6 21.0
Mazda RX4 Wag       6 21.0
Datsun 710          4 22.8
Hornet 4 Drive      6 21.4
Hornet Sportabout   8 18.7
Valiant             6 18.1
                  vs disp
Mazda RX4          0  160
Mazda RX4 Wag      0  160
Datsun 710         1  108
Hornet 4 Drive     1  258
Hornet Sportabout  0  360
Valiant            1  225

Let’s make something more practical. For each list of variable arguments, we want to group using the first variable and then summarise the grouped data frame by calculating the mean of the second variable. Here, dynamic argument construction really comes into account, because we programmatically construct the arguments of summarise_(), e.g. mean_mpg = mean(mpg) using string concatenation and setNames():

summarise_vars <- list(list('cyl', 'mpg'), list('vs', 'disp'))

for (v in summarise_vars) {
  group_var <- v[1]   # group by this variable
  summ <- paste0('mean(', v[2], ')')  # construct summary method, e.g. mean(mpg)
  summ_name <- paste0('mean_', v[2])  # construct summary variable name, e.g. mean_mpg

  print(paste('grouping by', group_var, 'and summarising', summ))

  df_summ <- mtcars %>%
    group_by_(.dots = group_var) %>%
    summarise_(.dots = setNames(summ, summ_name))

  print(df_summ)
}

# output
[1] "grouping by cyl and summarising mean(mpg)"
# A tibble: 3 × 2
    cyl mean_mpg

1     4 26.66364
2     6 19.74286
3     8 15.10000
[1] "grouping by vs and summarising mean(disp)"
# A tibble: 2 × 2
     vs mean_disp

1     0  307.1500
2     1  132.4571

# To refer to column names that are stored as strings, use the `.data` pronoun:
vars <- c("mass", "height")
cond <- c(80, 150)
starwars %>%
  filter(
    .data[[vars[[1]]]] > cond[[1]],
    .data[[vars[[2]]]] > cond[[2]]
  )

6.1.9 Merge

left_join(x, y, by = NULL, suffix = c(".x", ".y"), ...)

x,y tbls to join
by a character vector of variables to join by. If NULL, the default, *_join() will do a natural join, using all variables with common colnames across the two tables.

To join by the same variables on x and y, use by = c('ID','year'), note that it is a character vector, can’t use column names directly.

To join by different variables on x and y, use a named vector. For example, by = c("a" = "b") will match x.a to y.b.
suffix If there are non-joined duplicate variables in x and y, these suffixes will be added to the output to disambiguate them. Should be a character vector of length 2.

merge(x, y, by = NULL, by.x = NULL, by.y = NULL, all = FALSE, ...) is the base R function for joining two data frames.

all = FALSE defaults to FALSE, it performs an inner join, retaining only the rows with matching keys.

If TRUE, it performs a full outer join, retaining all rows from both x and y.