Incorrect value of n() when nested in map() and group_by()

kent37 · August 22, 2018, 8:24pm

Using dplyr::n() to count the number of elements in groups within nested data frames doesn't seem to give the correct value; it counts the number of rows in the top-level data frame rather than the number of rows in the group.

This is the simplest reprex I could come up with. And yes, I know I could do this with group_by(field, sub). In my real example the nested data frames seemed more convenient. ISTM this should work; am I missing something?

Kent

suppressMessages(library(dplyr))
library(purrr)
library(tidyr)

# Data frame with three primary groups of size 10.
# Each contains five sub-groups of size 2
d = data_frame(field=rep(1:3, each=10), sub=rep(1:5, 6), value=1:30)
d
#> # A tibble: 30 x 3
#>    field   sub value
#>    <int> <int> <int>
#>  1     1     1     1
#>  2     1     2     2
#>  3     1     3     3
#>  4     1     4     4
#>  5     1     5     5
#>  6     1     1     6
#>  7     1     2     7
#>  8     1     3     8
#>  9     1     4     9
#> 10     1     5    10
#> # ... with 20 more rows

# Nest to make a separate data frame per field
dd = d %>% nest(-field)
dd
#> # A tibble: 3 x 2
#>   field data             
#>   <int> <list>           
#> 1     1 <tibble [10 x 2]>
#> 2     2 <tibble [10 x 2]>
#> 3     3 <tibble [10 x 2]>

# I want a count of the number of items in each sub-group.
# Do this using group_by(), summarize() and n().
# It works if I just process one element of `data`.
# Here `count` has the correct value (2)

dd$data[[1]] %>% group_by(sub) %>% summarize(count=n(), mean=mean(value))
#> # A tibble: 5 x 3
#>     sub count  mean
#>   <int> <int> <dbl>
#> 1     1     2   3.5
#> 2     2     2   4.5
#> 3     3     2   5.5
#> 4     4     2   6.5
#> 5     5     2   7.5

# When the same summary operations are applied to the entire `data` column using `map`,
# the count is nrow(dd) rather than the size of the subgroup.
dd %>% 
  mutate(result = map(data, ~.x %>% group_by(sub) %>% 
                        summarize(count=n(), mean=mean(value)))) %>% 
  select(-data) %>% unnest 
#> # A tibble: 15 x 4
#>    field   sub count  mean
#>    <int> <int> <int> <dbl>
#>  1     1     1     3   3.5
#>  2     1     2     3   4.5
#>  3     1     3     3   5.5
#>  4     1     4     3   6.5
#>  5     1     5     3   7.5
#>  6     2     1     3  13.5
#>  7     2     2     3  14.5
#>  8     2     3     3  15.5
#>  9     2     4     3  16.5
#> 10     2     5     3  17.5
#> 11     3     1     3  23.5
#> 12     3     2     3  24.5
#> 13     3     3     3  25.5
#> 14     3     4     3  26.5
#> 15     3     5     3  27.5

Created on 2018-08-22 by the reprex package (v0.2.0).

mara · August 22, 2018, 8:34pm

Yep, there's an issue open in the dplyr repo, and it's being actively worked on:

github.com/tidyverse/dplyr

Using n() in nested mutate()/summarize() calls gives unexpected results

opened 05:19PM - 19 Aug 16 UTC

closed 07:42AM - 14 Sep 18 UTC

mwillumz

bug

When transitioning from by_row() to map() approach I've found that several dplyr…/purrr/tidyr functions do not evaluate within the map() environment. For instance below I was expecting the value returned by n() in the map() example to match that of the by_row() version. Instead it returns the number of rows of the nested input `Temp`. This might be intended but I can't think of an obvious way to use dplyr::n() on nested tibbles via map(). ``` library(dplyr); library(purrr); library(tidyr) data(iris) Temp <- iris %>% group_by(Species) %>% nest() ByRow <- by_row(Temp, function(x){ x$data[[1]] %>% filter(Sepal.Length >= 6) %>% summarise(petal_length_avg = mean(Petal.Length), obs = n()) }, .to = 'test') ByRow$test Map <- mutate(Temp, test = map(data, . %>% filter(Sepal.Length >= 6) %>% summarise(petal_length_avg = mean(Petal.Length), obs = n()))) Map$test ```

kent37 · August 22, 2018, 8:54pm

OK thanks for letting me know! A workaround is to pass a named function to map. This gives the correct result:

summy = function(x) x %>% group_by(sub) %>% 
  summarize(count=n(), mean=mean(value))

dd %>% 
  mutate(result = map(data, summy)) %>% 
  select(-data) %>% unnest