Chapter 9 Tidyverse

tidyverse is a collection of packages for data analyses. This package is designed to make it easy to install and load multiple tidyverse packages in a single step. The following packages are included in the core tidyverse: ggplot2, dplyr, tidyr, readr, purrr, tibble, stringr, forcats, lubridate.

The tidyverse also includes many other packages with more specialized usage. They are not loaded automatically with library(tidyverse), so you’ll need to load each one with its own call to library().

Resolve conflict commands

When there are conflicts between packages, R gives precedence to the most recently loaded package by default. This can make it hard to detect conflicts, particularly when introduced by an update to an existing package.

To address this issue, you can either specify the namespace when calling a function (e.g., dplyr::lag()), or use the conflicted package.

lag <- dplyr::lag           # conflict with stats::lag
between <- dplyr::between
select <- dplyr::select
count <- plyr::count        # conflict with dplyr::count
summarise <- dplyr::summarise
rename <- dplyr::rename
TeX <- latex2exp::TeX
margin <- ggplot2::margin

conflicted package will throw an error when there is a conflict, and forces you to explicitly choose which function to use when there are conflicts between packages.

# Install conflicted
# devtools::install_github("r-lib/conflicted")

# Use example
library(conflicted)
library(dplyr)
filter(mtcars, cyl == 8)
#> Error:
#> ! [conflicted] filter found in 2 packages.
#> Either pick the one you want with `::`:
#> • dplyr::filter
#> • stats::filter
#> Or declare a preference with `conflicts_prefer()`:
#> • `conflicts_prefer(dplyr::filter)`
#> • `conflicts_prefer(stats::filter)`

Declare a session-wide preference with conflicts_prefer():

# conflicts_prefer is faster and easier to use
conflicts_prefer(dplyr::filter())
#> [conflicted] Will prefer dplyr::filter over any other package.

# you can also use `conflict_prefer()`, which provide more precise control
conflict_prefer("filter", "dplyr")

filter(mtcars, am & cyl == 8)
#>                 mpg cyl disp  hp drat   wt qsec vs am gear carb
#> Ford Pantera L 15.8   8  351 264 4.22 3.17 14.5  0  1    5    4
#> Maserati Bora  15.0   8  301 335 3.54 3.57 14.6  0  1    5    8

Report any conflicts in the current session with conflict_scout():

conflict_scout()
#> 2 conflicts
#> • `filter()`: dplyr
#> • `lag()`: dplyr and stats

tibble Package

Create a tibble, just the same way as data.fram, only that without row names.

tibble(x = 1:5, y = 1, z = x ^ 2 + y)

tibble() does much less than data.frame(): it never changes the type of the inputs (e.g. it never converts strings to factors!), it never changes the names of variables, it only recycles inputs of length 1, and it never creates row.names().

as_tibble() vs tibble():

as_tibble() turns an existing object, such as a data frame or matrix, into a so-called tibble, a data frame with class tbl_df.
This is in contrast with tibble(), which builds a tibble from individual columns.

If using tibble() on a whole data frame, it would generate a one column tibble in which the column contains the data frame.

tibble columns are versatile, can be lists, matrices, tibbles, etc.

tibble(
  a = list(
    c = "three", 
    d = list(4:5)
    )
)
#> # A tibble: 2 × 1
#>   a        
#>   <named list>
#> 1 <chr [1]>   
#> 2 <list [1]>

Print tibbles

tbl_df %>% print(n = Inf) print all rows. print.tbl_df is useful in terms of explicitly and setting arguments like n and width.

n print the first n rows. When n=Inf, it means to print all rows.
width Width of text output to generate. This defaults to NULL, which means use the width in options(). When width=Inf, will print all columns.

Use ?print.tbl_df to show help page.

Alternatively, use, tbl_df %>% data.frame() to print the whole table. data.frame won’t round numbers. Usually tbl round at the 6-th digit after the decimal point.

print(as_tibble(mtcars), n = 3) first convert to tibble, then specify the rows to print.

data.table package has nice table print settings. You can preview the head and tail at the same time.

Regular print.tbl will print the head (first 6 rows) of the table only.
print.data.table gives you a feeling of the data structure without using head and tail functions twice. See below.
?print.data.table for help page. See printing options for a data.table.

library(data.table)
DT <- tibble(a = rnorm(1000), b = rnorm(1000)) %>% as.data.table()
DT %>% 
  print(topn = 4)

                a            b
            <num>        <num>
   1:  1.32712276 -0.778009241
   2:  1.71029980  1.507805184
   3: -0.41397245  0.816250538
   4:  0.51490112  0.499563882
  ---                         
 997: -0.06823802  0.541311448
 998: -1.48304667 -2.293768686
 999:  0.94170427  1.363411322
1000: -0.45273759 -0.006411937

print.data.table(x, topn = 5)

Options	Function
`topn = 5`	Default to 5. The number of rows to be printed from the beginning and end of tables.
`nrows = 100`	Max number of rows that will ever be printed. Set a upper boundary, rarely used in practice. Equivalent to `print.tbl(n = 100)`
`class = TRUE`	Whether to print column types. Recommend set to `TRUE`.
`row.names = TRUE`	Whether to print row names.

The data.table R package is being used in different fields such as finance and genomics and is especially useful for those of you that are working with large data sets (for example, 1GB to 100GB in RAM).

data.table Cheatsheet: https://www.datacamp.com/cheat-sheet/the-datatable-r-package-cheat-sheet

Data Frame and Vector Conversion

reframe can return an arbitrary number of rows per group, while summarise()reduces each group down to a single row and mutate returns the same number of rows as the input.

reframe() always returns an ungrouped data frame.

reframe() is theoretically connected to two functions in tibble, tibble::enframe()and tibble::deframe():

enframe(): vector → data frame
deframe(): data frame → vector
reframe(): data frame → data frame, with arbitrary number of rows per group.

enframe and deframe convert vectors to tibbles and vice verse.

Example Usage:

enframe(1:3)
#> # A tibble: 3 × 2
#>    name value
#>   <int> <int>
#> 1     1     1
#> 2     2     2
#> 3     3     3
enframe(c(a = 5, b = 7))
#> # A tibble: 2 × 2
#>   name  value
#>   <chr> <dbl>
#> 1 a         5
#> 2 b         7
enframe(list(one = 1, two = 2:3, three = 4:6))
#> # A tibble: 3 × 2
#>   name  value    
#>   <chr> <list>   
#> 1 one   <dbl [1]>
#> 2 two   <int [2]>
#> 3 three <int [3]>
deframe(enframe(3:1))
#> 1 2 3 
#> 3 2 1 
deframe(tibble(a = 1:3))
#> [1] 1 2 3
deframe(tibble(a = as.list(1:3)))
#> [[1]]
#> [1] 1
#> 
#> [[2]]
#> [1] 2
#> 
#> [[3]]
#> [1] 3

Concatenate list elements into a table

Use magrittr’s pipe operator
```
myList %>% do.call("rbind", .)
```
But using the new base/native pipe (|>) leads to errors performing the same operation:
```
myList |> do.call("rbind", .)
#> Error in do.call(myList, "rbind", .) : 
#>  second argument must be a list
```
The error happens because |> always inserts into the first argument and does NOT support dot. A workaround is to use named arguements:
```
myList |> do.call(what = "rbind")
```

Use bind_rows from dplyr or rbindlist from data.table:

library(dplyr)
myList |> bind_rows()

library(data.table)
myList |> rbindlist()

Use reduce from purrr
```
library(purrr)
myList |> reduce(rbind)
```

one row/column tibble

as_tibble_row(x) and as_tibble_col(x, column_name="value") convert a vector to one row or one column tibble; from vetor to tibble.

as_tibble(data, rownames="new_col_name") convert (df) to tibble. Flexible with the format of the input data, can be a range of classes.

data A data frame, list, matrix, or other object that could reasonably be coerced to a tibble.
rownames the name of a new column. Existing rownames are transferred into this column. If NULL then remove the rowname column.

rownames_to_column(.data, var = "new colname") and column_to_rownames(.data, var = "col to use as rownames") using one column as row names, or converting row names to one column.

.data needs to be a data frame; strict with input data type;
var
- in rownames_to_column: new column name for original rownames in the data.frame, or
- in column_to_rownames: convert tibble to data frame, and specify which column to use as rownames.