Incorrect value of n() when nested in map() and group_by()

Using dplyr::n() to count the number of elements in groups within nested data frames doesn't seem to give the correct value; it counts the number of rows in the top-level data frame rather than the number of rows in the group.

This is the simplest reprex I could come up with. And yes, I know I could do this with group_by(field, sub). In my real example the nested data frames seemed more convenient. ISTM this should work; am I missing something?

Kent

suppressMessages(library(dplyr))
library(purrr)
library(tidyr)

# Data frame with three primary groups of size 10.
# Each contains five sub-groups of size 2
d = data_frame(field=rep(1:3, each=10), sub=rep(1:5, 6), value=1:30)
d
#> # A tibble: 30 x 3
#>    field   sub value
#>    <int> <int> <int>
#>  1     1     1     1
#>  2     1     2     2
#>  3     1     3     3
#>  4     1     4     4
#>  5     1     5     5
#>  6     1     1     6
#>  7     1     2     7
#>  8     1     3     8
#>  9     1     4     9
#> 10     1     5    10
#> # ... with 20 more rows

# Nest to make a separate data frame per field
dd = d %>% nest(-field)
dd
#> # A tibble: 3 x 2
#>   field data             
#>   <int> <list>           
#> 1     1 <tibble [10 x 2]>
#> 2     2 <tibble [10 x 2]>
#> 3     3 <tibble [10 x 2]>

# I want a count of the number of items in each sub-group.
# Do this using group_by(), summarize() and n().
# It works if I just process one element of `data`.
# Here `count` has the correct value (2)

dd$data[[1]] %>% group_by(sub) %>% summarize(count=n(), mean=mean(value))
#> # A tibble: 5 x 3
#>     sub count  mean
#>   <int> <int> <dbl>
#> 1     1     2   3.5
#> 2     2     2   4.5
#> 3     3     2   5.5
#> 4     4     2   6.5
#> 5     5     2   7.5

# When the same summary operations are applied to the entire `data` column using `map`,
# the count is nrow(dd) rather than the size of the subgroup.
dd %>% 
  mutate(result = map(data, ~.x %>% group_by(sub) %>% 
                        summarize(count=n(), mean=mean(value)))) %>% 
  select(-data) %>% unnest 
#> # A tibble: 15 x 4
#>    field   sub count  mean
#>    <int> <int> <int> <dbl>
#>  1     1     1     3   3.5
#>  2     1     2     3   4.5
#>  3     1     3     3   5.5
#>  4     1     4     3   6.5
#>  5     1     5     3   7.5
#>  6     2     1     3  13.5
#>  7     2     2     3  14.5
#>  8     2     3     3  15.5
#>  9     2     4     3  16.5
#> 10     2     5     3  17.5
#> 11     3     1     3  23.5
#> 12     3     2     3  24.5
#> 13     3     3     3  25.5
#> 14     3     4     3  26.5
#> 15     3     5     3  27.5

Created on 2018-08-22 by the reprex package (v0.2.0).

1 Like

Yep, there's an issue open in the dplyr repo, and it's being actively worked on:

1 Like

OK thanks for letting me know! A workaround is to pass a named function to map. This gives the correct result:

summy = function(x) x %>% group_by(sub) %>% 
  summarize(count=n(), mean=mean(value))

dd %>% 
  mutate(result = map(data, summy)) %>% 
  select(-data) %>% unnest 
4 Likes