Home Summarizing multiple columns with dplyr?

# Summarizing multiple columns with dplyr?

Daniel
1#
Daniel Published in 2014-02-08 10:27:10Z
 This question already has an answer here: Aggregate / summarize multiple variables per group (i.e. sum, mean, etc) 4 answers Can dplyr summarise over several variables without listing each one? [duplicate] 2 answers I'm struggling a bit with the dplyr-syntax. I have a data frame with different variables and one grouping variable. Now I want to calculate the mean for each column within each group, using dplyr in R. df <- data.frame(a=sample(1:5, 10, replace=T), b=sample(1:5, 10, replace=T), c=sample(1:5, 10, replace=T), d=sample(1:5, 10, replace=T), grp=sample(1:3, 10, replace=T)) df %>% group_by(grp) %>% summarise(mean(a))  This gives me the mean for column "a" for each group indicated by "grp". My question is: is it possible to get the means for each column within each group at once? Or do I have to repeat df %>% group_by(grp) %>% summarise(mean(a)) for each column? What I would like to have is something like df %>% group_by(grp) %>% summarise(mean(a:d)) # "mean(a:d)" does not work 
Artem Klevtsov
2#
Artem Klevtsov Reply to 2017-02-24 06:43:07Z
 dplyr 0.2 contains summarise_each for this aim: df %>% group_by(grp) %>% summarise_each(funs(mean)) #> Source: local data frame [3 x 5] #> #> grp a b c d #> (int) (dbl) (dbl) (dbl) (dbl) #> 1 1 3.000000 2.666667 2.666667 3.333333 #> 2 2 2.666667 2.666667 2.500000 2.833333 #> 3 3 4.000000 1.000000 4.000000 3.000000  Alternatively, the purrr package provides the same functionality: df %>% slice_rows("grp") %>% dmap(mean) #> Source: local data frame [3 x 5] #> #> grp a b c d #> (int) (dbl) (dbl) (dbl) (dbl) #> 1 1 3.000000 2.666667 2.666667 3.333333 #> 2 2 2.666667 2.666667 2.500000 2.833333 #> 3 3 4.000000 1.000000 4.000000 3.000000  Also don't forget about data.table: setDT(df)[, lapply(.SD, mean), by = grp] #> grp a b c d #> 1: 3 3.714286 3.714286 2.428571 2.428571 #> 2: 1 1.000000 4.000000 5.000000 2.000000 #> 3: 2 4.000000 4.500000 3.000000 3.000000 
Keiku
3#
 We can summarize by using summarize_at, summarize_all and summarize_if on dplyr 0.7.4. We can set the multiple columns and functions by using vars and funs argument as below code. The left-hand side of funs formula is assigned to suffix of summarized vars. In the dplyr 0.7.4, summarise_each(and mutate_each) is already deprecated, so we cannot use these functions. options(scipen = 100, dplyr.width = Inf, dplyr.print_max = Inf) library(dplyr) packageVersion("dplyr") # [1] ‘0.7.4’ set.seed(123) df <- data_frame( a = sample(1:5, 10, replace=T), b = sample(1:5, 10, replace=T), c = sample(1:5, 10, replace=T), d = sample(1:5, 10, replace=T), grp = as.character(sample(1:3, 10, replace=T)) # For convenience, specify character type ) df %>% group_by(grp) %>% summarise_each(.vars = letters[1:4], .funs = c(mean="mean")) # summarise_each() is deprecated. # Use summarise_all(), summarise_at() or summarise_if() instead. # To map funs over a selection of variables, use summarise_at() # Error: Strings must match column names. Unknown columns: mean  You should change to the following code. The following codes all have the same result. # summarise_at df %>% group_by(grp) %>% summarise_at(.vars = letters[1:4], .funs = c(mean="mean")) df %>% group_by(grp) %>% summarise_at(.vars = names(.)[1:4], .funs = c(mean="mean")) df %>% group_by(grp) %>% summarise_at(.vars = vars(a,b,c,d), .funs = c(mean="mean")) # summarise_all df %>% group_by(grp) %>% summarise_all(.funs = c(mean="mean")) # summarise_if df %>% group_by(grp) %>% summarise_if(.predicate = function(x) is.numeric(x), .funs = funs(mean="mean")) # A tibble: 3 x 5 # grp a_mean b_mean c_mean d_mean # # 1 1 2.80 3.00 3.6 3.00 # 2 2 4.25 2.75 4.0 3.75 # 3 3 3.00 5.00 1.0 2.00  You can also have multiple functions. df %>% group_by(grp) %>% summarise_at(.vars = letters[1:2], .funs = c(Mean="mean", Sd="sd")) # A tibble: 3 x 5 # grp a_Mean b_Mean a_Sd b_Sd # # 1 1 2.80 3.00 1.4832397 1.870829 # 2 2 4.25 2.75 0.9574271 1.258306 # 3 3 3.00 5.00 NA NA 
David Arenburg
4#
David Arenburg Reply to 2016-01-21 11:02:18Z
 You can simply pass more arguments to summarise: df %>% group_by(grp) %>% summarise(mean(a), mean(b), mean(c), mean(d))  Source: local data frame [3 x 5]  grp mean(a) mean(b) mean(c) mean(d) 1 1 2.500000 3.500000 2.000000 3.0 2 2 3.800000 3.200000 3.200000 2.8 3 3 3.666667 3.333333 2.333333 3.0 
Steven Matz
5#
Steven Matz Reply to 2015-12-01 23:06:25Z
 For completeness: with dplyr v0.2 ddply with colwise will also do this: > ddply(df, .(grp), colwise(mean)) grp a b c d 1 1 4.333333 4.00 1.000000 2.000000 2 2 2.000000 2.75 2.750000 2.750000 3 3 3.000000 4.00 4.333333 3.666667  but it is slower, at least in this case: > microbenchmark(ddply(df, .(grp), colwise(mean)), df %>% group_by(grp) %>% summarise_each(funs(mean))) Unit: milliseconds expr min lq mean ddply(df, .(grp), colwise(mean)) 3.278002 3.331744 3.533835 df %>% group_by(grp) %>% summarise_each(funs(mean)) 1.001789 1.031528 1.109337 median uq max neval 3.353633 3.378089 7.592209 100 1.121954 1.133428 2.292216 100 
Matt Dancho
6#
Matt Dancho Reply to 2017-03-06 21:37:46Z
 All the examples are great, but I figure I'd add one more to show how working in a "tidy" format simplifies things. Right now the data frame is in "wide" format meaning the variables "a" through "d" are represented in columns. To get to a "tidy" (or long) format, you can use gather() from the tidyr package which shifts the variables in columns "a" through "d" into rows. Then you use the group_by() and summarize() functions to get the mean of each group. If you want to present the data in a wide format, just tack on an additional call to the spread() function. library(tidyverse) # Create reproducible df set.seed(101) df <- tibble(a = sample(1:5, 10, replace=T), b = sample(1:5, 10, replace=T), c = sample(1:5, 10, replace=T), d = sample(1:5, 10, replace=T), grp = sample(1:3, 10, replace=T)) # Convert to tidy format using gather df %>% gather(key = variable, value = value, a:d) %>% group_by(grp, variable) %>% summarize(mean = mean(value)) %>% spread(variable, mean) #> Source: local data frame [3 x 5] #> Groups: grp [3] #> #> grp a b c d #> * #> 1 1 3.000000 3.5 3.250000 3.250000 #> 2 2 1.666667 4.0 4.666667 2.666667 #> 3 3 3.333333 3.0 2.333333 2.333333 
 You need to login account before you can post.
Processed in 0.408877 second(s) , Gzip On .