5.8 Data Frame Operations
Basic operations:
colSums(x)returns sum for each columnrowSums(x)returns sum for each rowsetNames()updates the column names without having to write another replacement function.
magrittr package provides a series of aliases which can be more pleasant to use when composing chains using the %>%operator.
| Alias | Cmd |
|---|---|
set_colnames |
colnames<- |
set_rownames |
rownames<- |
set_names |
names<- |
subset(x, subset, select, drop = FALSE, …) Subsetting Vectors, Matrices And Data Frames
xobject to be subsetted.subsetlogical expression indicating elements or rows to keep: missing values are taken as false.selectcolumns to select from a data frame.
Column/Row-wise Operations
sweep(data, MARGIN, STATS, FUN='-') Return an array obtained from an input array by sweeping out a summary statistic; useful in standardizing data, eg., center or scale columns.
data: matrix/dataframe
MARGIN: 1 row-wise, 2 column-wise..
STATS: a vector with the same length as row [MARGIN=1] / column [MARGIN=2].
FUN: the function to be used to carry out the sweep.
Example:
sweep(z, 2, colMeans(z),'-')substract column mean from each column;
Alternatively, can use mutate_at. You can perform the operation on selected cols without changing the data structure. By contrast, with sweep, you need to subset the relevant cols, apply the operation, and concatenate them back to the remainder of the columns afterwards.
Takeaway: Use sweep if the operation is to all cols; use mutate_at if to selected cols.
center_col <- function(data, cols){
## Mean center columns in a table
# @data: table or data frame
# @col: a vector of selected columns to center
# @return A data frame with the selected columns mean-centered
# (i.e., each value minus its column mean).
data %>%
mutate_at(cols, ~.-mean(., na.rm=TRUE))
}Remove duplicate columns, regardless of column names:
5.8.1 Process NA values
NaN not a number. 0/0 is an example of a calculation that will produce a NaN. NaNs print as NaN, but generally act like NAs. Use is.nan to check if NaN.
Inf is infinite numbers. is.infinite find infinite numbers (Inf, -Inf).
na.omit(x) remove NA values in x;
xcould be vectors, matrices, and data frames;- if
xis a data frame, remove rows with NA values;
- if
Find NA values
complete.cases(x) return a logical vector indicating which cases/rows are complete.
purrr::negate() works similar to base::Negate()
- R is case-sensitive.
(!is.na(x)) %>% colSums() %>% sort() returns a vector of the number of non-NA values per column, column names as vector name.
When you have a long list of columns, vector is hard to read, use as_tibble_row() %>% t() to convert to a tibble column.
# when you have a long list of columns, you can convert to a tibble column for best visualization.
miss_per_col <- data %>%
filter(date > ymd("2013-01-01")) %>%
is.na() %>%
colSums()
miss_per_col <- miss_per_col %>% as_tibble_row() %>% t()
miss_per_col %>% dim()
miss_per_col[1:40, ] %>% t() %>% t()
miss_per_col[41:77, ] %>% t() %>% t()(!is.na(x)) %>% rowSums() %>% sort() calculate the number of non-NA values per row
tidyr::drop_na(x, any_of(vars)) allow you to specify which columns you want to eliminate NA values from; it doesn’t have to be the whole columns;
xmust be a data frame.
which(is.na(data)) returns positions of omitted missing values
A custom function for handling common missing/invalid values:
cast_na <- function(x){
# Remove records with nonvalid values, such as NA, Inf, NaN
# remain x when mask==TRUE
if (is.null(dim(x))){
# `x` is a vector
mask <- !(x %in% c(NA, NaN, Inf, -Inf) )
x <- x[mask]
}
else {
# `x` is a data.frame
mask <- apply(x, 1, function(x) sum(x %in% c(NA, NaN, Inf, -Inf)))
mask <- (mask == 0)
x <- x[mask,]}
return (x)
}Keep rows with least NA’s for duplicated rows
Fill missing values
Forward/Backward filling
tidyr::fill(data, ..., .direction = c("down", "up", "downup", "updown")) Fill missing values in selected columns using the next or previous entry. This is useful in the common output format where values are not repeated, and are only recorded when they change.
...Columns to fill..directionDirection in which to fill missing values. Default: “down”.
Alternatively, you can use zoo::na.locf
fromLast defaults to FALSE, carry forward. If set to TRUE, carry backward.
Replace with specific values
dplyr::na_if(x, y) that replaces any values in x that are equal to y with NA. It is useful if you want to convert an annoying value to NA.
> na_if(1:5, 5:1)
# [1] 1 2 NA 4 5
> y <- c("abc", "def", "", "ghi")
> na_if(y, "")
# [1] "abc" "def" NA "ghi"tidyr::replace_na(data, replace) Replace NAs with specified values
replaceIf
datais a data frame,replacetakes a list of values, with one value for each column that hasNAvalues to be replaced.If
datais a vector,replacetakes a single value. This single value replaces all of theNAvalues in the vector.