Hard to understand how group_by() works

set.seed(5)
testdata = tibble(x=sample(c("a","b","c"),size=1000,replace=T),
                  y=rnorm(1000),
                  z=rnorm(1000))
testdata %>% group_by(x) %>% dim

Can anyone explain why the output of the last code is the dimension of the whole testdata, not the dimensions of each of the three groups of it?

group_by() works only with functions in the tidyverse (especially those in the dplyr package). dim() is a function built into R. Similarly, other built-in functions like nrow() or ncol() don't pay attention to whether group_by() has been run on the data.

If you want to know the number of observations in each group, try using dplyr's count() instead:

testdata %>% group_by(x) %>% count()
3 Likes

Thank you for the reply. Actually I wrote a quite complex function which takes a tibble (or dataframe) as input and outputs a tibble. I want to apply this function to each group of a tibble defined by a variable. Let's say the function I want to apply is testfunction, the dataframe is testdata, and the grouping variable is label. What I want to do is something like this:

testdata %>% group_by(label) %>% testfunction

But this doesn't work as expected because testfunction contains many R built-in functions in itself. What would be the best solution in this case?

You could try split() instead of group_by().

2 Likes

You can use group_modify() from dplyr (make sure you have version >=0.8.0) for this:

testdata %>%
  group_by(label) %>%
  group_modify(testfunction)

See here for more on group_modify and related functions!

1 Like

This topic was automatically closed 7 days after the last reply. New replies are no longer allowed.