tabyl works with map, but group_by and summarize don't

Hi,

I'm having trouble understanding how the map function works. I often would like to apply a function to multiple columns in a dataframe, but sometimes run into problems with some of the dplyr verbs. Below is an example of a simple frequency tabulation by some variable(s). I have had good luck with janitor::tabyl, but not so much with group_by and summarize (or filter). Often what I want to do would take greater advantage of the summarize function than the examples below, so knowing how map might work with it would be useful.

Regards

suppressPackageStartupMessages({
  library(tidyverse)
  library(janitor)
})

# function for tabulating n by group using dplyr verbs
sum_n <- function(df, x) {
  df %>% 
    group_by({{x}}) %>% 
    summarise(n = n())
}

# function for tabulating n by group using tabyl
tabyl_n <- function(df, x) {
  df %>% 
    tabyl({{x}}) %>% 
    select(-percent) %>% 
    as_tibble()
}

# both produce the same output
# using summarize
mtcars %>% 
  sum_n(cyl)
#> # A tibble: 3 × 2
#>     cyl     n
#>   <dbl> <int>
#> 1     4    11
#> 2     6     7
#> 3     8    14
#using tabyl
mtcars %>% 
  tabyl_n(cyl)
#> # A tibble: 3 × 2
#>     cyl     n
#>   <dbl> <dbl>
#> 1     4    11
#> 2     6     7
#> 3     8    14

# but both do not work with map
# vector for map
sum_vars <- c("cyl", "am", "gear")

# tabyl works with map
map(sum_vars, ~tabyl_n(mtcars, .x))
#> Note: Using an external vector in selections is ambiguous.
#> ℹ Use `all_of(.x)` instead of `.x` to silence this message.
#> ℹ See <https://tidyselect.r-lib.org/reference/faq-external-vector.html>.
#> This message is displayed once per session.
#> [[1]]
#> # A tibble: 3 × 2
#>     cyl     n
#>   <dbl> <dbl>
#> 1     4    11
#> 2     6     7
#> 3     8    14
#> 
#> [[2]]
#> # A tibble: 2 × 2
#>      am     n
#>   <dbl> <dbl>
#> 1     0    19
#> 2     1    13
#> 
#> [[3]]
#> # A tibble: 3 × 2
#>    gear     n
#>   <dbl> <dbl>
#> 1     3    15
#> 2     4    12
#> 3     5     5

# summarize produces this error
map(sum_vars, ~sum_n(mtcars, .x))
#> Error: Must group by variables found in `.data`.
#> * Column `.x` is not found.

Created on 2021-11-30 by the reprex package (v2.0.1)

Hey there!

If you look closely, you can see that in the one-column example and in your mapped example, your present cyl in different ways. The first time as an object using dplyr-like non-standard evaluation. The second time, however, as a character. That's the problem right there. If you run

mtcars %>% 
  sum_n("cyl")

you wont get the same output as before.

We can change your original function to accept column names as characters by using across() in the group_by():

sum_n <- function(df, x) {
  df %>% 
    group_by(across(x)) %>% 
    summarise(n = n())
}

Now if we run your mapped example it works as expected:

sum_vars <- c("cyl", "am", "gear")
purrr::map(sum_vars, ~sum_n(mtcars, .x))

across() is very useful, especially in combination with tidyr-selector functions, as you can dynamically do something to multiple columns as well. You can also use it in summarise() for example to get the median of your sum_vars So it's worth checking out.

mtcars %>%
  summarise(
    across(
      any_of(sum_vars),
      median)
    )

One last note: also check out group_map that can apply functions to each group of a data.frame and group_modify.

Hope this helps.
Best,
Valentin

1 Like

Great! That very much does help. I'm still learning about "small" things to look out for, like unintentionally passing a character to a dplyr verb. And even if I did notice, it would not have occured to me to use across as a solution. So thanks for that. I do often use it with summarize as you pointed out, but in this case map is useful since I also have a plot as part of the actual function I'm using. And also thanks for the tip about group_map.

1 Like

This topic was automatically closed 7 days after the last reply. New replies are no longer allowed.

If you have a query related to it or one of the replies, start a new topic and refer back with a link.