5.5 *apply() Family
tapply(x, INDEX, FUN, ...) break x into groups based on INDEX, apply FUN to each group (subset), and return the results in a convenient form.
INDEXalistof one or more factors, each of same length asX. The elements are coerced to factors byas.factor.- Return a vecotr when
FUNreturns a single atomic value, length determined by # of component inINDEX; ifFUNreturns more than one value,tapplyreturns a list.
tapply(X, INDEX, FUN = NULL)
Arguments:
-X: An object, usually a vector
-INDEX: A list containing factor, of length of X
-FUN: Function applied to each element of xapply(x, MARGIN, FUN)
-x: an array or matrix
-MARGIN: take a value or range between 1 and 2 to define where to apply the function:
-MARGIN=1: the manipulation is performed per row
-MARGIN=2: the manipulation is performed per column
-MARGIN=c(1,2) the manipulation is performed on rows and columnslapply() takes list, vector or data frame as input and gives output in list.
lapply(df, FUN) is a shortcut to apply(df, MARGIN=2, FUN), conducting FUN on each column.
sapply() function takes list, vector or data frame as input and gives output in vector or matrix. sapply() function does the same job as lapply() function but returns a vector. sapply() function is more efficient than lapply() in the output returned because sapply() store values direclty into a vector.
sapply(X, FUN, ...)
X: A vector or an objectFUN: Function to be applied to each element ofX. In the case of functions like+,%*%, the function name must be backquoted or quoted....: optional arguments toFUN.
A summary for differences of *apply() functions.
| Function | Arguments | Objective | Input | Output |
|---|---|---|---|---|
| apply | apply(x, MARGIN, FUN) | Apply a function to the rows or columns or both, MARGIN=1 on rows, MARGIN=2 on cols |
Data frame or matrix | vector, list, array |
| lapply | lapply(X, FUN, …) | Apply a function to all the elements of the input | List, vector or data frame | list |
| sapply | sappy(X, FUN, …) | Apply a function to all the elements of the input | List, vector or data frame | vector or matrix |
| mapply | mapply(FUN, … , MoreArgs = NULL, SIMPLIFY = TRUE) | mapply is a multivariate version of sapply. mapply applies FUN to the first elements of each ... argument, the second elements, the third elements, and so on. Arguments are recycled if necessary. |
Multiple List or multiple Vector Arguments. | vector or matrix |
| tapply | tapply(X, INDEX, FUN = NULL) | applies a function or operation on subset of the vector broken down by a given factor variable. Similar to group_by and summarize. |
List, vector or data frame | vector or list |
A useful application is to combine lapply() or sapply() with subsetting:
x <- list(1:3, 4:9, 10:12)
sapply(x, "[", 2)
#> [1] 2 5 11
# equivalent to
sapply(x, function(x) x[2])
#> [1] 2 5 11mapply(FUN, ... , MoreArgs = NULL, SIMPLIFY = TRUE, USE.NAMES = TRUE)
...FUNarguments to vectorize over (vectors or lists of strictly positive length, or all of zero length).MoreArgspassing a list of other arguments, that don’t need to vectorize, toFUN.SIMPLIFYlogical or character string; attempt to reduce the result to a vector, matrix or higher dimensional array;- For
sapplyit must be named and not abbreviated. The default value,TRUE, returns a vector or matrix if appropriate, whereas ifSIMPLIFY = "array"the result may be anarrayof “rank” (=length(dim(.))) one higher than the result ofFUN(X[[i]]).
- For
USE.NAMElogical; use names if the first…argument has names, or if it is a character vector, use that character vector as the names.
mapply calls FUN for the values of ... (re-cycled to the length of the longest, unless any have length zero), followed by the arguments given in MoreArgs. The arguments in the call will be named if ... or MoreArgs are named.
mapply(rep, times = 1:4, x = 4:1)
# [[1]]
# [1] 4
#
# [[2]]
# [1] 3 3
#
# [[3]]
# [1] 2 2 2
#
# [[4]]
# [1] 1 1 1 1
mapply(rep, times = 1:4, MoreArgs = list(x = 42))
# [[1]]
# [1] 42
#
# [[2]]
# [1] 42 42
#
# [[3]]
# [1] 42 42 42
#
# [[4]]
# [1] 42 42 42 42clusterMap(cl = NULL, fun, ..., MoreArgs = NULL, RECYCLE = TRUE, SIMPLIFY = FALSE, USE.NAMES = TRUE, .scheduling = c("static", "dynamic")) is the parallel version for mapply.
To iterate over more than one variable, clusterMap is very useful. Since you’re only iterating over int1 and int2, you should use the “MoreArgs” option to specify the variables that you aren’t iterating over. clusterMap returns the results in a list by default, you should be able to combine the results using do.call('rbind', result).
cluster <- makeCluster(detectCores())
clusterEvalQ(cluster, library(xts))
result <- clusterMap(cluster, function1, int1=1:8, int2=c(1, rep(0, 7)),
MoreArgs=list(df1=df1, df2=df2, char1="someString"))
df <- do.call('rbind', result)parApply(cl = NULL, X, MARGIN, FUN, ..., chunk.size = NULL)
parApplyis the parallel version ofapplywhileclusterApplyapply a function to a list of arguments.- No vectorization is involved.
clusterApply(cl = NULL, x, fun, ...)
clusterApply calls fun on the first node with arguments x[[1]] and …, on the second node with x[[2]] and …, and so on, recycling nodes as needed. clusterApply only vectorize x. Example:
5.5.1 Parallel version of apply function
parallel::parLapply Performs the calculations in parallel, possibly on several nodes
parLapply(cl = NULL, X, fun, ...)
cla cluster objectXA vector (atomic or list) forparLapplyandparSapply, an array forparApply.funfunction or character string naming a function....additional arguments to pass tofun: beware of partial matching to earlier arguments. e.g.:parSapply(cl,var1,FUN=myfunction,var2=var2,var3=var3,var4=var4)
Note that:
- Can use several types of communications, including
PSOCKandMPI - For parLapply, the worker processes must be prepared with any loaded packages with
clusterEvalQorclusterCall. - For
parLapply, large data sets can be exported to workers withclusterExport. - Best practice: Test interactively with
lapplyserially, andmclapplyorparLapply(PSOCK) in parallel
## Sample codes for using cluster
library(parallel)
# initialize a cluster
cl <- makeCluster(4, type='SOCK')
clusterEvalQ(cl, library(raster))
cldCON <- parSapply(cl, as.list(cldNC), simulation_mask,
tgt=con_shape[shape_dict[con_name],])
# stop a cluster
stopCluster(cl)mclapply(X, FUN, ...) is a parallelized version of lapply, it returns a list of the same length as X, each element of which is the result of applying FUN to the corresponding element of X.
parallel::mcmapply(FUN, ..., MoreArgs = NULL, SIMPLIFY = TRUE, USE.NAMES = TRUE, mc.preschedule = TRUE, mc.set.seed = TRUE, mc.silent = FALSE, mc.cores = getOption("mc.cores", 2L), mc.cleanup = TRUE, affinity.list = NULL) is a parallelized version of mapply.
mc*apply takes an argument, mc.cores. By default, mc*apply will use all cores available to it.
- If you don’t want to (either becaues you’re on a shared system or you just want to save processing power for other purposes) you can set this to a value lower than the number of cores you have.
- Setting it to 1 disables parallel processing, and setting it higher than the number of available cores has no effect.
stackApply(r, indices=num_years, fun=mean, na.rm=T, ...)
Apply a function on subsets of a RasterStack or RasterBrick. The layers to be combined are indicated with the vector indices. The function used should return a single value, and the number of layers in the output Raster* equals the number of unique values in indices. For example, if you have a RasterStack with 6 layers, you can use indices=c(1,1,1,2,2,2) and fun=sum. This will return a RasterBrick with two layers.
xRaster*objectindicesinteger. Vector of lengthnlayers(x)
num_yearss <- rep(1:55, each=12) # 55 years
r_year <- stackApply(r, indices=num_years, fun=mean, na.rm=T)
# A parallel version
beginCluster(4)
r_year <- clusterR(r, stackApply,
args=list(indices=num_years, fun=mean, na.rm=T))
endCluster()snow package for parallel computing
snow::clusterMap(cl, fun, ..., MoreArgs=NULL, RECYCLE=TRUE)
clusterMapis a multi-argument version ofclusterApply, analogous tomapply. IfRECYCLEis true shorter arguments are recycled; otherwise, the result length is the length of the shortest argument. Cluster nodes are recycled if the length of the result is greater than the number of nodes.
clusterApply(cl, x, fun, ...) calls fun on the first cluster node with arguments seq[[1]] and …, on the second node with seq[[2]] and …, and so on. If the length of seq is greater than the number of nodes in the cluster then cluster nodes are recycled. A list of the results is returned; the length of the result list will equal the length of seq.
clusterCall(cl, fun, ...) calls a function fun with identical arguments ... on each node in the cluster cl and returns a list of the results.
clusterEvalQ(cl, expr) evaluates a literal expression on each cluster node. It a cluster version of evalq, and is a convenience function defined in terms of clusterCall.
clusterExport(cl, list, envir = .GlobalEnv) assigns the values on the master of the variables named in list to variables of the same names in the global environments of each node. The environment on the master from which variables are exported defaults to the global environment.
cl <- makeSOCKcluster(c("localhost","localhost"))
clusterApply(cl, 1:2, get("+"), 3)
clusterEvalQ(cl, library(boot)) # load packages needed
x<-1
clusterExport(cl, "x")
clusterCall(cl, function(y) x + y, 2)
endCluster()