6.1 Basic operations on tibbles
6.1.1 Check Unique Values
n_distinct(x)
This is a faster and more concise equivalent of length(unique(x))
, # of unique values in x
.
distinct(df, ..., .keep_all=FALSE)
select distinct/unique rows, remove duplicate rows.
df
: table...
: variables to use when determining uniqueness. If omitted, will use all variables.If there are multiple rows for a given combination of inputs, only the first row will be preserved.
If two variables are provided, then unique combinations of the two are used as key and the first row of each key is preserved.
.keep_all
: IfTRUE
, keep all variables indf
dplyr::setdiff(x, y)
element that is in x
but not in y
.
x
andy
are supposed to have the same structure, i.e., same columns if for data frames.
6.1.2 Column Names
rename()
replaces an old name with a new one.
magrittr
package provides a series of aliases which can be more pleasant to use when composing chains using the %>%
operator.
Alias | Cmd |
---|---|
set_colnames |
colnames<- |
set_rownames |
rownames<- |
set_names |
names<- |
set_names()
changes all column names at once, and requires a character vector with a name matching each column.
6.1.3 Column Operations
Create New Columns
add_column(df, ..., .after=NULL, .before=NULL)
add new column after the last column.
df
: Data frame to append to;...
: Name-value pairs to insert;.before, .after
: One-based column index or column name where to add the new columns, default: after last column
data %>% mutate(column = .[[2]] - .[[1]])
subset by column positions.
Here the dot notation in
.[[2]]
refers todata
, the variable you pipe intomutate
. Dot is extra useful here because it allows you to usedata
multiple times. We use it twice in this example.Alternatively,
data %>% mutate(column = unlist(pick(2) - pick(1)))
unlist
here transform the list generated frompick(2)-pick(1)
to a vector.
6.1.4 mutate
mutate()
adds new columns at the penultimate (second last) position in the sf
object (the last one is reserved for the geometry):
mutate_at(.tbl, .vars, .funs)
applies a function to given columns:
.tbl
A tbl object;.vars
A list of columns generated byvars()
, a character vector of column names, a numeric vector of column positions, or NULL..fun
A function fun, a quosure style lambda~ fun(.)
or a list of either form.
mtcars <- mtcars %>%
mutate_at(c("hp_wt", "mpg_wt"), log)
# note that `across` use together with `mutate`
mtcars <- mtcars %>%
mutate(across(c("hp_wt", "mpg_wt"), log))
## factor as numeric, except for `ISO_C3`
agg_dummy %>%
mutate_at(vars(-ISO_C3), ~as.numeric(levels(.)[.]))
across(.cols, .fnc, .names=NULL)
apply the same transformation to multiple columns, allowing you to use select()
semantics inside in “data-masking” functions like summarise()
and mutate()
.
.cols
Columns to transform. You can NOT select grouping columns because they are already automatically handled by the verb.can specify start and end columns using
:
..fnc
Functions to apply to each of the selected columns. Possible values are:A function, e.g.
mean
.A purrr-style lambda, e.g.
~ mean(.x, na.rm = TRUE)
A named list of functions or lambdas, e.g.
list(mean = mean, n_miss = ~ sum(is.na(.x))
. Each function is applied to each column, and the output is named by combining the function name and the column name using the glue specification in.names
.
mutate()
and mutate_at()
outputs a new data frame, it does not alter the given data frame. We need to reassign the data frame to be the output of the pipe.
mutate_all(.tbl, .funs, ...)
is equivalent to apply(x, 2, FUN)
. But apply()
works poor when column types are not unanimous. E.g., when a tibble has character
and numeric
columns, apply()
tends to coerce numeric
to character
, and won’t return the result as you expext.
6.1.5 Concatenate rows into a tibble
add_row(.data, ..., .before = NULL, .after = NULL)
add one or more rows of data to an existing data frame,
convenient in the way that you can just specify each column with their values.
especially convenient when you just want to add one row.
...
<dynamic-dots
> Name-value pairs, passed on totibble()
.- Values can be defined only for columns that already exist in
.data
and; - unset columns will get an
NA
value.
- Values can be defined only for columns that already exist in
.before
,.after
specify the position where you want to add the new row/rows.
# add_row ---------------------------------
df <- tibble(x = 1:3, y = 3:1)
df %>% add_row(x = 4, y = 0)
bind_rows(..., .id = NULL)
This is an efficient implementation of the common pattern of do.call(rbind, dfs)
. Match by column names.
The output of bind_rows()
will contain a column if that column appears in any of the inputs. rbind
will throw errors if columns do not match.
...
Data frames to combine.Each argument can either be a data frame, a list that could be a data frame, or a list of data frames.
When row-binding, columns are matched by name, and any missing columns will be filled with NA. So no column is dropped, which is safe.
When column-binding (
bind_cols(df1, df2, ...)
), rows are matched by position, so all data frames must have the same number of rows. To match by value, not position, see mutate-joins.Need to be careful when you use
bind_cols
, make sure rows are in the same order in the tables you want to join.bind_cols
is equivalent tocbind
: match by position. Recommend to useleft_join
, which is safer.bind_rows
is safer thanrbind
:bind_rows
find matched col names
.id
Data frame identifier.- When
.id
is supplied, a new column of identifiers is created to link each row to its original data frame. The labels are taken from the named arguments tobind_rows()
. When a list of data frames is supplied, the labels are taken from the names of the list. If no names are found a numeric sequence is used instead.
- When
res = NULL
for (i in tibbleList)
res = bind_rows(res, i)
# or, equivalently,
bind_rows(tibbleList) # combine all tibbles in the list
6.1.6 Data Subsetting
pull(df, var)
pull out a single variable and return a vector. Similar to $
, but works well with %>%
.
var
: A variable specified as:
- a literal variable name
- a positive integer, giving the position counting from the left
- a negative integer, giving the position counting from the right.
filter(.data, ..., .preserve=FALSE)
chooses rows/cases where conditions are true. Unlike base subsetting with [
, rows where the condition evaluates to NA
are dropped.
...
<data-masking
> Expressions that return a logical value, and are defined in terms of the variables in.data
. If multiple expressions are included, they are combined with the&
operator. Only rows for which all conditions evaluate toTRUE
are kept..preserve
Relevant when the.data
input is grouped. If.preserve = FALSE
(the default), the grouping structure is recalculated based on the resulting data, otherwise the grouping is kept as is.
# Countries with a life expectancy longer than 82 years
world6 = filter(world, lifeExp > 82)
# filter based on vector
filter(diamonds, cut %in% c('Ideal', 'Premium'))
Useful filter functions:
==
,>
,>=
etc&
,|
,!
,xor()
is.na()
between()
,near()
between(x, left, right)
is a shortcut for x >= left & x <= right
.
Ex. x[between(x, -1, 1)]
select()
subsets by column, slice
subsets by rows. select()
selects columns by name or position.
select()
also allows subsetting of a range of columns with the help of the :
operator:
select()
can be used to reorder/drop variables.
select(df, year, var, state) # reverses the columns
select(df, -state) # drop column by name 'state'
If you only know you want var
in the front and don’t care about the order of the rest, you can do [move one variable in the front]
all_of(vars)
is used together with select
for strict selection. If any of the variables in the character vector is missing, an error is thrown.
vars
A vector of character names or numeric locations.
any_of(vars)
doesn’t check for missing variables. It is especially useful with negative selections, when you would like to make sure a variable is removed.
# select columns may or may not exist
the_country %>% select(any_of(c("isoa2", "countryCode")) )
# remove columns may or may not exist
the_country %>% select(-any_of(c("isoa2", "countryCode")) )
This flexibility is useful because it won’t return an error if the variable is not found.
Selection with conditions
Functions work together with select
to choose cols matching certain conditions: starts_with()
, ends_with()
, contains()
. These are selection helpers which match variables according to a given pattern.
> iris %>% select(starts_with("Sepal"))
#> # A tibble: 150 x 2
#> Sepal.Length Sepal.Width
#> <dbl> <dbl>
#> 1 5.1 3.5
#> 2 4.9 3
#> 3 4.7 3.2
#> 4 4.6 3.1
#> # i 146 more rows
> iris %>% select(ends_with("Width"))
#> # A tibble: 150 x 2
#> Sepal.Width Petal.Width
#> <dbl> <dbl>
#> 1 3.5 0.2
#> 2 3 0.2
#> 3 3.2 0.2
#> 4 3.1 0.2
#> # i 146 more rows
6.1.7 Dynamic Selection
A dynamic subset of variables when using select
dynamic_var <- 'state'
df %>% select(year, var, eval(dynamic_var)) # dynamic_var will be parsed as state
# --- dplyr version 0.7+---
multipetal <- function(df, n) {
varname <- paste("petal", n , sep=".")
mutate(df, !!varname := Petal.Width * n)
}
# !! unquote
# using := to dynamically assign/change parameter names
variables in the left hand side
plot_v <- "maize"
plot_data <- plot_data %>%
select(year, plot_v, tmx, pre, rad) %>%
mutate(!!plot_v := log(eval(parse(text = plot_v)) )) %>%
group_by(year) %>%
summarise(across(everything(), ~mean(.x, na.rm = TRUE) ) )
plot_data
Use !!
to unquote a single argument in a function call. !!
takes a single expression, evaluates it, and inlines the result in the AST.
rlang::sym(x)
take a string as input and turn it into symbols
string x
-> expression expr(x)
-> evaluate!!expr(x)
!!!
the behaviour of !!!
is known as “spatting” in Ruby, Go, PHP, and Julia. It is closely related to *args
(star-args) and **kwarg
(star-star-kwargs) in Python, which are sometimes called argument unpacking.
:=
rather than interpreting var
literally (pronounced colon-equals), we want to use the value stored in the variable called var
.
Note the use of :=
(pronounced colon-equals) rather than =
. Unfortunately we need this new operation because R’s grammar does not allow expressions as argument names:
:=
is like a vestigial organ: it’s recognised by R’s parser, but it doesn’t have any code associated with it. It looks like an =
but allows expressions on either side, making it a more flexible alternative to =
. It is used in data.table for similar reasons.
SE-versions of dplyr verbs
https://dplyr.tidyverse.org/reference/se-deprecated.html
dplyr used to offer twin versions of each verb suffixed with an underscore. These versions had standard evaluation (SE) semantics: rather than taking arguments by code, like NSE verbs, they took arguments by value. Their purpose was to make it possible to program with dplyr.
However, dplyr now uses tidy evaluation semantics. NSE verbs still capture their arguments, but you can now unquote parts of these arguments. This offers full programmability with NSE verbs. Thus, the underscored versions are now superfluous.
gcm_name <- "CanESM5"
mergeData %>% filter(get(gcm_name)<35 )
mergeData %>% filter_(sprintf( "%s<35", gcm_name ) ) # SE version
mergeData %>% select_(gcm_name) # SE version
select_(.data, .dots=list() )
.data
A data frame..dots
,...
Pair/values of expressions coercible to lazy objects.
vars <- list(list('cyl', 'mpg'), list('vs', 'disp'))
for (v in vars) {
print(mtcars %>% select_(.dots = v) %>% head)
}
cyl mpg
Mazda RX4 6 21.0
Mazda RX4 Wag 6 21.0
Datsun 710 4 22.8
Hornet 4 Drive 6 21.4
Hornet Sportabout 8 18.7
Valiant 6 18.1
vs disp
Mazda RX4 0 160
Mazda RX4 Wag 0 160
Datsun 710 1 108
Hornet 4 Drive 1 258
Hornet Sportabout 0 360
Valiant 1 225
Let’s make something more practical. For each list of variable arguments, we want to group using the first variable and then summarise the grouped data frame by calculating the mean of the second variable. Here, dynamic argument construction really comes into account, because we programmatically construct the arguments of summarise_()
, e.g. mean_mpg = mean(mpg)
using string concatenation and setNames()
:
summarise_vars <- list(list('cyl', 'mpg'), list('vs', 'disp'))
for (v in summarise_vars) {
group_var <- v[1] # group by this variable
summ <- paste0('mean(', v[2], ')') # construct summary method, e.g. mean(mpg)
summ_name <- paste0('mean_', v[2]) # construct summary variable name, e.g. mean_mpg
print(paste('grouping by', group_var, 'and summarising', summ))
df_summ <- mtcars %>%
group_by_(.dots = group_var) %>%
summarise_(.dots = setNames(summ, summ_name))
print(df_summ)
}
# output
[1] "grouping by cyl and summarising mean(mpg)"
# A tibble: 3 × 2
cyl mean_mpg
1 4 26.66364
2 6 19.74286
3 8 15.10000
[1] "grouping by vs and summarising mean(disp)"
# A tibble: 2 × 2
vs mean_disp
1 0 307.1500
2 1 132.4571
6.1.8 Merge
left_join(x, y, by = NULL, suffix = c(".x", ".y"), ...)
x,y
tbls to joinby
a character vector of variables to join by. IfNULL
, the default,*_join()
will do a natural join, using all variables with common colnames across the two tables.To join by the same variables on
x
andy
, useby = c('ID','year')
, note that it is a character vector, can’t use column names directly.To join by different variables on
x
andy
, use a named vector. For example,by = c("a" = "b")
will matchx.a
toy.b
.suffix
If there are non-joined duplicate variables in x and y, these suffixes will be added to the output to disambiguate them. Should be a character vector of length 2.
merge(x, y, by = NULL, by.x = NULL, by.y = NULL, all = FALSE, ...)
is the base R function for joining two data frames.
all = FALSE
defaults toFALSE
, it performs an inner join, retaining only the rows with matching keys.If
TRUE
, it performs a full outer join, retaining all rows from bothx
andy
.