 # Combining sapply() with group_by()

Hello everyone, I am currently trying to combining the function sapply() with group_by(). So basically I just want to perform a simple summary descriptive statistic with mean,median,min,max,etc for each column/variable in the data and before I apply it in R Shiny, I have done it first in R console and it looks like this:

But now I want to separate it based on the group Product(A and B), it looks more likely like this:
(These are just an example with V1 and V2 as the columns/variables, grouped by Placebo and Extract)

In SAS with PROC MEANS looks like this:

``````PROC MEANS N MIN Q1 MEDIAN Q3 MEAN STD MAXDEC=3 DATA=bref.neudaten;
VAR Weight_V1 Weight_V2 Weight_V3;
BY Product;
RUN;
``````

These are both example in Excel and in SAS, Can anyone help me to implement it in R? Thank you!

Create a reprex (FAQ: How to do a minimal reproducible example ( reprex ) for beginners) and it will be much easier to help you Sorry for the late reply and for the confusing request. Here is the reprex and hopefully it helps:

``````library(dplyr)

DF <- data.frame(Product=c("A","A","B","B"),
Weight_V1=c(55,62,51,44),
Weight_V2=c(65,67,71,82),
Weight_V3=c(24,53,53,46),
Body_Fat=c(54,23,42,12))

#Summary for all variables without group_by(), it works fine
sapply(DF[-1], function(x) summary(x))

#Summary for all variable with Product as CLASS, But I am not quite sure how
sapply(DF[-1] %>% group_by("Product"), function(x) summary(x))

``````

I want to have summary for all the variables classified by the Product A and B, just like what I did in SAS above. Thanks for your help!

This is the kind of thing `tapply` is more suited to

``````tapply(DF[-1], DF\$Product, summary)

\$A
Weight_V1       Weight_V2      Weight_V3        Body_Fat
Min.   :55.00   Min.   :65.0   Min.   :24.00   Min.   :23.00
1st Qu.:56.75   1st Qu.:65.5   1st Qu.:31.25   1st Qu.:30.75
Median :58.50   Median :66.0   Median :38.50   Median :38.50
Mean   :58.50   Mean   :66.0   Mean   :38.50   Mean   :38.50
3rd Qu.:60.25   3rd Qu.:66.5   3rd Qu.:45.75   3rd Qu.:46.25
Max.   :62.00   Max.   :67.0   Max.   :53.00   Max.   :54.00

\$B
Weight_V1       Weight_V2       Weight_V3        Body_Fat
Min.   :44.00   Min.   :71.00   Min.   :46.00   Min.   :12.0
1st Qu.:45.75   1st Qu.:73.75   1st Qu.:47.75   1st Qu.:19.5
Median :47.50   Median :76.50   Median :49.50   Median :27.0
Mean   :47.50   Mean   :76.50   Mean   :49.50   Mean   :27.0
3rd Qu.:49.25   3rd Qu.:79.25   3rd Qu.:51.25   3rd Qu.:34.5
Max.   :51.00   Max.   :82.00   Max.   :53.00   Max.   :42.0
``````

If you're married to using `group_by` / `tidyverse` it will work better with `summarize` and could be done like this

``````DF %>%
group_by(Product) %>%
group_map(~summarize(.x, across(everything(), summary)))

[]
# A tibble: 6 x 4
Weight_V1 Weight_V2 Weight_V3 Body_Fat
<table>   <table>   <table>   <table>
1 55.00     65.0      24.00     23.00
2 56.75     65.5      31.25     30.75
3 58.50     66.0      38.50     38.50
4 58.50     66.0      38.50     38.50
5 60.25     66.5      45.75     46.25
6 62.00     67.0      53.00     54.00

[]
# A tibble: 6 x 4
Weight_V1 Weight_V2 Weight_V3 Body_Fat
<table>   <table>   <table>   <table>
1 44.00     71.00     46.00     12.0
2 45.75     73.75     47.75     19.5
3 47.50     76.50     49.50     27.0
4 47.50     76.50     49.50     27.0
5 49.25     79.25     51.25     34.5
6 51.00     82.00     53.00     42.0
``````
1 Like

This is the answer that I'm looking for! Thank you so much!. I have one question, could you explain me why there must be this symbol "~" beside the function summarize? I am still not quite understand about it

The `~` was introduced in, I think, `purrr` to facilitate concise function statements, similar to "lambda" functions in Python. The following two incatations are equivalent, the first using the lambda notation, the other the more traditional R anonymous function.

``````map_dbl(1:5, ~.x + 10)
map_dbl(1:5, function(x) { x + 10 })
``````

The equivalent for `group_map` would look like

``````group_map(function(x, y) {
summarize(x, across(everything(), summary))
})
``````

It needs 2 arguments, here `x` and `y`: `x` for the group, and `y` for the group key (at least, I think that's the reason)

Edit
An extra note is that `.x` and `.y` are automatically recognized as the first and second inputs in the "lambda" incantation.

1 Like

Thank you for the explanation! that really helpful! I have one problem with the function tapply(). I tried to use the data iris() and want to seperate the summary based on column "species". Here is the code:

``````data <- iris
tapply(data[-5],data\$Species,summary)
``````

but comes the error:

Error in tapply(data[-5], data\$Species, summary) :
arguments must have same length

Could you help me with this one? I have tried to check the length of each variable, all have 150.

One other thing, I have tried using group_by/tidyverse, but it keeps showing me error:

Error in across(everything(), summary) : could not find function "across"

I have installed the packages dplyr but still it keeps showing me this error.

Hmm, not sure right now why `tapply` throws that error, but as always there are many ways to get the same result

`by()`

This will create a list of class `by`, but you can extract the element in the usual way

``````by(iris[-5], iris, summary) #[1:3] # add the subset to extract each list element
``````

`split()`

Split is kind of like `group_by`, we can split by the factor and then use `lapply` to loop over each subset

``````lapply(split(iris[-5], iris), summary)
``````

`across()`

As for `across`, it is exported from `dplyr`, so you hve to use `library(dplyr)` before you can use it. If you have attatched `dplyr` maybe you have an older version? Try updating it.

Now it works fine, I just updated to the newest version. Thanks a lot! Do you know any possibilities to put different formula on .FUN instead of using summary()? for example I have tried:

``````by(iris[-5],iris,summarise_all(list(mean=mean,median=median,q1=quantile(.,probs=0.25),q3=quantile(.,probs=0.75))))
``````

but it doesn't work. Do you have any idea? Sorry for asking a lot of questions First of all it feels like I might be doing you homework for you? If this is an assignment (which is fine - this community is here to help) please read this

1. Yes you can use any function you like in `FUN`
2. Doing it as you are, is using an "anonymous" function - you might find it easier to visualize if you pull the function out and create a standalone function.
3. Some functions (eg `summary`) are quite happy to work on a dataframe. Othere like 'mean` / 'median` etc will only work on a vector, so you can say for example `mean(iris[1:4])`.
. To do what you are doing you would have to loop over each column, then loop over each function, while looping over each group of the data - doable but troublesome.
4. Think about what I said earlier: `tidyverse` it will work better with `summarize`
• While the `tidyverse` functions will often play fine with base R, they are really designed to work with other `tidyverse` functions, and will perform far better doing so.

Understand! Thank you for your help!