Inconsistent grouping behavior with lists vs. non-lists?

So, I'm getting a little confused here as I increasingly work with nested lists inside tibble columns.

First, a simple, no-lists example. With this, I know mutate will operate row-by-row without having to specify group_by() and it will produce values in the new RR column which are the same as in data_var column:

library(tidyverse)

df <- tribble(
  ~name, ~data_var,
  "Jose", 5L,
  "Beth", 7L,
  "George", 10L
)

# RR column is correct
df %>%
  mutate(RR = data_var)
#> # A tibble: 3 x 3
#>   name   data_var    RR
#>   <chr>     <int> <int>
#> 1 Jose          5     5
#> 2 Beth          7     7
#> 3 George       10    10

Created on 2019-04-23 by the reprex package (v0.2.1)

Now for a really simple nested list example (the actual data I'm working on is more involved). Here, the first attempt returns out a 5 for all rows, which is not what I expected. The second attempt does what I expect it to do, but I have to use a group_by() argument first.

My question is...why?

library(tidyverse)

df2 <- tribble(
  ~name, ~data_var,
  "Jose", list('RR' = 5L),
  "Beth", list('RR' = 7L),
  "George", list('RR' = 10L))

# RR column is incorrect, shouldn't be all 5. This is unexpected.
df2 %>%
  mutate(RR = pluck(data_var, 1, 1))
#> # A tibble: 3 x 3
#>   name   data_var      RR
#>   <chr>  <list>     <int>
#> 1 Jose   <list [1]>     5
#> 2 Beth   <list [1]>     5
#> 3 George <list [1]>     5

# RR column is now correct after using group_by(). But why?
df2 %>%
  group_by(name) %>%
  mutate(RR = pluck(data_var, 1, 1))
#> # A tibble: 3 x 3
#> # Groups:   name [3]
#>   name   data_var      RR
#>   <chr>  <list>     <int>
#> 1 Jose   <list [1]>     5
#> 2 Beth   <list [1]>     7
#> 3 George <list [1]>    10

Created on 2019-04-23 by the reprex package (v0.2.1)

1 Like

mutate doesn't operate row-wise. The expressions within it evaluate like normal R, using the dataset or a subset group as the environment. (I can guarantee what I just said is not the exact truth, but it works for now).

So

df2 %>%
  mutate(RR = pluck(data_var, 1, 1))

Takes data_var, which is a list of length 3, and "plucks" the first element of the first element of it (i.e., df2[["data_var"]][[1]][[1]]), which is 5L. So that's working as the packages' authors intended.

The second example works because mutate follows a group_by. Each of the resulting subgroups has exactly one row. Only by coincidence does it work row-wise. Consider this example where it wouldn't:

library(dplyr)
library(purrr)

df3 <- tribble(
  ~name, ~data_var,
  "Jose", list('RR' = 5L),
  "Beth", list('RR' = 7L),
  "George", list('RR' = 10L),
  "George", list('RR' = 15L)
)

df3 %>%
  group_by(name) %>%
  mutate(RR = pluck(data_var, 1, 1))
# # A tibble: 4 x 3
# # Groups:   name [3]
#   name   data_var      RR
#   <chr>  <list>     <int>
# 1 Jose   <list [1]>     5
# 2 Beth   <list [1]>     7
# 3 George <list [1]>    10
# 4 George <list [1]>    10

The way to work with list columns is to use higher order functions. With purrr, you can do what you want with this:

df3 %>%
  mutate(RR = map_int(data_var, pluck, 1))
# # A tibble: 4 x 3
#   name   data_var      RR
#   <chr>  <list>     <int>
# 1 Jose   <list [1]>     5
# 2 Beth   <list [1]>     7
# 3 George <list [1]>    10
# 4 George <list [1]>    15
6 Likes

What a clear answer. Makes total sense, and thanks for suggesting using pluck inside a map_ function, that makes a lot of sense too!

1 Like

This topic was automatically closed 7 days after the last reply. New replies are no longer allowed.

If you have a query related to it or one of the replies, start a new topic and refer back with a link.