Combining sapply() with group_by()

,

Hello everyone, I am currently trying to combining the function sapply() with group_by(). So basically I just want to perform a simple summary descriptive statistic with mean,median,min,max,etc for each column/variable in the data and before I apply it in R Shiny, I have done it first in R console and it looks like this:

But now I want to separate it based on the group Product(A and B), it looks more likely like this:
(These are just an example with V1 and V2 as the columns/variables, grouped by Placebo and Extract)

In SAS with PROC MEANS looks like this:

PROC MEANS N MIN Q1 MEDIAN Q3 MEAN STD MAXDEC=3 DATA=bref.neudaten;
  VAR Weight_V1 Weight_V2 Weight_V3;
  BY Product;
RUN;

These are both example in Excel and in SAS, Can anyone help me to implement it in R? Thank you!

Create a reprex (FAQ: How to do a minimal reproducible example ( reprex ) for beginners) and it will be much easier to help you :slight_smile:

Sorry for the late reply and for the confusing request. Here is the reprex and hopefully it helps:

library(dplyr)

DF <- data.frame(Product=c("A","A","B","B"), 
                 Weight_V1=c(55,62,51,44), 
                 Weight_V2=c(65,67,71,82), 
                 Weight_V3=c(24,53,53,46), 
                 Body_Fat=c(54,23,42,12))

#Summary for all variables without group_by(), it works fine
sapply(DF[-1], function(x) summary(x))

#Summary for all variable with Product as CLASS, But I am not quite sure how
sapply(DF[-1] %>% group_by("Product"), function(x) summary(x))

I want to have summary for all the variables classified by the Product A and B, just like what I did in SAS above. Thanks for your help!

This is the kind of thing tapply is more suited to

tapply(DF[-1], DF$Product, summary)

$A
Weight_V1       Weight_V2      Weight_V3        Body_Fat    
Min.   :55.00   Min.   :65.0   Min.   :24.00   Min.   :23.00  
1st Qu.:56.75   1st Qu.:65.5   1st Qu.:31.25   1st Qu.:30.75  
Median :58.50   Median :66.0   Median :38.50   Median :38.50  
Mean   :58.50   Mean   :66.0   Mean   :38.50   Mean   :38.50  
3rd Qu.:60.25   3rd Qu.:66.5   3rd Qu.:45.75   3rd Qu.:46.25  
Max.   :62.00   Max.   :67.0   Max.   :53.00   Max.   :54.00  

$B
Weight_V1       Weight_V2       Weight_V3        Body_Fat   
Min.   :44.00   Min.   :71.00   Min.   :46.00   Min.   :12.0  
1st Qu.:45.75   1st Qu.:73.75   1st Qu.:47.75   1st Qu.:19.5  
Median :47.50   Median :76.50   Median :49.50   Median :27.0  
Mean   :47.50   Mean   :76.50   Mean   :49.50   Mean   :27.0  
3rd Qu.:49.25   3rd Qu.:79.25   3rd Qu.:51.25   3rd Qu.:34.5  
Max.   :51.00   Max.   :82.00   Max.   :53.00   Max.   :42.0

If you're married to using group_by / tidyverse it will work better with summarize and could be done like this

DF %>% 
  group_by(Product) %>% 
  group_map(~summarize(.x, across(everything(), summary)))

[[1]]
# A tibble: 6 x 4
  Weight_V1 Weight_V2 Weight_V3 Body_Fat
  <table>   <table>   <table>   <table> 
1 55.00     65.0      24.00     23.00   
2 56.75     65.5      31.25     30.75   
3 58.50     66.0      38.50     38.50   
4 58.50     66.0      38.50     38.50   
5 60.25     66.5      45.75     46.25   
6 62.00     67.0      53.00     54.00   

[[2]]
# A tibble: 6 x 4
  Weight_V1 Weight_V2 Weight_V3 Body_Fat
  <table>   <table>   <table>   <table> 
1 44.00     71.00     46.00     12.0    
2 45.75     73.75     47.75     19.5    
3 47.50     76.50     49.50     27.0    
4 47.50     76.50     49.50     27.0    
5 49.25     79.25     51.25     34.5    
6 51.00     82.00     53.00     42.0  
1 Like

This is the answer that I'm looking for! Thank you so much!. I have one question, could you explain me why there must be this symbol "~" beside the function summarize? I am still not quite understand about it

The ~ was introduced in, I think, purrr to facilitate concise function statements, similar to "lambda" functions in Python. The following two incatations are equivalent, the first using the lambda notation, the other the more traditional R anonymous function.

map_dbl(1:5, ~.x + 10)
map_dbl(1:5, function(x) { x + 10 })

The equivalent for group_map would look like

group_map(function(x, y) {
  summarize(x, across(everything(), summary))
  })

It needs 2 arguments, here x and y: x for the group, and y for the group key (at least, I think that's the reason)

Edit
An extra note is that .x and .y are automatically recognized as the first and second inputs in the "lambda" incantation.

1 Like

Thank you for the explanation! that really helpful! I have one problem with the function tapply(). I tried to use the data iris() and want to seperate the summary based on column "species". Here is the code:

data <- iris
tapply(data[-5],data$Species,summary)

but comes the error:

Error in tapply(data[-5], data$Species, summary) :
arguments must have same length

Could you help me with this one? I have tried to check the length of each variable, all have 150.

One other thing, I have tried using group_by/tidyverse, but it keeps showing me error:

Error in across(everything(), summary) : could not find function "across"

I have installed the packages dplyr but still it keeps showing me this error.
Thank you in advanced!

Hmm, not sure right now why tapply throws that error, but as always there are many ways to get the same result

by()

This will create a list of class by, but you can extract the element in the usual way

by(iris[-5], iris[5], summary) #[1:3] # add the subset to extract each list element

split()

Split is kind of like group_by, we can split by the factor and then use lapply to loop over each subset

lapply(split(iris[-5], iris[5]), summary)

across()

As for across, it is exported from dplyr, so you hve to use library(dplyr) before you can use it. If you have attatched dplyr maybe you have an older version? Try updating it.

Now it works fine, I just updated to the newest version. Thanks a lot! Do you know any possibilities to put different formula on .FUN instead of using summary()? for example I have tried:

by(iris[-5],iris[5],summarise_all(list(mean=mean,median=median,q1=quantile(.,probs=0.25),q3=quantile(.,probs=0.75))))

but it doesn't work. Do you have any idea? Sorry for asking a lot of questions :pray:t2:

First of all it feels like I might be doing you homework for you? If this is an assignment (which is fine - this community is here to help) please read this

  1. Yes you can use any function you like in FUN
  2. Doing it as you are, is using an "anonymous" function - you might find it easier to visualize if you pull the function out and create a standalone function.
  3. Some functions (eg summary) are quite happy to work on a dataframe. Othere like 'mean / 'median etc will only work on a vector, so you can say for example mean(iris[1:4]).
    . To do what you are doing you would have to loop over each column, then loop over each function, while looping over each group of the data - doable but troublesome.
  4. Think about what I said earlier: tidyverse it will work better with summarize
    • While the tidyverse functions will often play fine with base R, they are really designed to work with other tidyverse functions, and will perform far better doing so.

Understand! Thank you for your help!