group_by() all columns except two, summarise() those two into one

bearloga · September 23, 2020, 10:27pm

Hi folks!

Short version

After checking out the colwise and grouping vignettes, I still have no idea how to perform a group_by all columns except two and then summarize those two columns into one.

Input:

tribble(
  ~id, ~q1, ~q2, ~q3, ~v3, ~q4,
  0, "a", "b", "x", 1, "c",
  0, "a", "b", "y", 2, "c",
  0, "a", "b", "z", 3, "c"
)

Desired output:

tribble(
  ~id, ~q1, ~q2, ~qv3, ~q4,
  0, "a", "b", list("x" = 1, "y" = 2, "z" = 3), "c"
)

All the documentation keeps pointing me to using across() inside summarise() and I looked into using {tidyselect} with those and at least that way I can group_by(id) and then use across(!contains("3"), head, n = 1L) to avoid grouping by all the other columns except q3 and v3, but it doesn't look like I can use a two-parameter function that would operate on q3 an v3. Maybe I'm missing something…

Also as a rule of thumb I'm trying to stay away from any superseded functions for future-proofing and let's say manually grouping by isn't an option because I'm dealing with 20+ columns with repeated values in the rows and just need to collapse rows across two columns into 1 cell.

Any help would be greatly appreciated!!!

Long version

Context/background: I'm working with survey responses from Google Forms and one of the questions was a check matrix/grid, and the way Google turns that into a spreadsheet is by making each row in the grid a column in the CSV and then the values the user selected become a concatenated list of values in the cell. For example:

package	use case 1	use case 2	use case 3
dplyr	x		x
tidyr	x	x

The CSV of survey responses would then have

package_dplyr	package_tidyr
use case 1; use case 3	use case 1; use case 2

So I used tidyr::pivot_longer() to collect those columns together:

... %>%
  pivot_longer(
    cols = starts_with("package_"),
    names_to = "package",
    values_to = "use_cases",
    names_prefix = "package_"
  )

But that makes

package	use_cases
dplyr	use case 1; use case 3
tidyr	use case 1; use case 2

And all the other questions & responses turn into repeated rows, and I want to have 1 row per survey responder. So what I'm asking about is summarise()-ing (with the intention to then dplyr::mutate() & purrr::map()) those into:

package_use_cases
`list(dplyr = c(1, 3), tidyr = c(1, 2))`

Thank you!!!

bearloga · September 23, 2020, 10:37pm

I feel like I'm close with this:

df <- tribble(
  ~id, ~q1, ~q2, ~q3, ~v3, ~q4,
  0, "a", "b", "x", 1, "c",
  0, "a", "b", "y", 2, "c",
  0, "a", "b", "z", 3, "c"
)
df %>%
  group_by(id) %>%
  summarize(
    across(!contains("3"), head, n = 1L),
    qv3 = map2(q3, v3, ~ set_names(as.list(.y), .x))
  )

`summarise()` regrouping output by 'id' (override with `.groups` argument)
# A tibble: 3 x 5
# Groups:   id [1]
     id q1    q2    q4    qv3             
  <dbl> <chr> <chr> <chr> <list>          
1     0 a     b     c     <named list [1]>
2     0 a     b     c     <named list [1]>
3     0 a     b     c     <named list [1]>

But not sure why those aren't being collapsed to 1 row???

grrrck · September 24, 2020, 12:18am

Okay, so I have to admit up front that I'm "cheating" a bit by solving your problem but not answering your question

The tl;dr is that in problems like these I like to nest the problematic columns, work on them as if they were a small little data frame using map() and then unnest them back into the parent row or table. Here's how I would solve this.

library(tidyverse)

Ignore the chunk in here, I’m just setting up a way to preview each step…

`%T>%` <- magrittr::`%T>%`
i <- 0
preview <- function(...) {
  i <<- i + 1
  cat("\nStep ", i, "----\n")
  print(...)
}

Here’s the example Google Forms results with an extra one (or more) non-package use case columns

survey <- tibble::tribble(
  ~ person,          ~package_dplyr,           ~package_tidyr,
         1, "use case 1; use case 3", "use case 1; use case 2",
         2, "use case 2; use case 3", "use case 1; use case 3"
)

If we had just a data frame with the package_ columns, and just a single respondent, I would do something like this…

survey[1, -1] %T>% preview() %>% 
  pivot_longer(
    cols = everything(),
    names_to = "package",
    values_to = "use_cases",
    names_prefix = "package_"
  ) %T>% preview() %>% 
  separate_rows(use_cases, sep = "; ?") %T>% preview() %>% 
  mutate(use_cases = as.integer(str_remove(use_cases, "use case "))) %T>% preview() %>% 
  pivot_wider(
    names_from = package, 
    values_from = use_cases, 
    values_fn = list
  ) %>% preview()
#> 
#> Step  1 ----
#> # A tibble: 1 x 2
#>   package_dplyr          package_tidyr         
#>   <chr>                  <chr>                 
#> 1 use case 1; use case 3 use case 1; use case 2
#> 
#> Step  2 ----
#> # A tibble: 2 x 2
#>   package use_cases             
#>   <chr>   <chr>                 
#> 1 dplyr   use case 1; use case 3
#> 2 tidyr   use case 1; use case 2
#> 
#> Step  3 ----
#> # A tibble: 4 x 2
#>   package use_cases 
#>   <chr>   <chr>     
#> 1 dplyr   use case 1
#> 2 dplyr   use case 3
#> 3 tidyr   use case 1
#> 4 tidyr   use case 2
#> 
#> Step  4 ----
#> # A tibble: 4 x 2
#>   package use_cases
#>   <chr>       <int>
#> 1 dplyr           1
#> 2 dplyr           3
#> 3 tidyr           1
#> 4 tidyr           2
#> 
#> Step  5 ----
#> # A tibble: 1 x 2
#>   dplyr     tidyr    
#>   <list>    <list>   
#> 1 <int [2]> <int [2]>

Using tidyr::nest() with purrr:map() we can do the above steps for each small package use case tibble from every respondent.

survey_packaged <- 
  survey %>% 
  nest(package_use_cases = contains("package_")) %>% 
  mutate(
    package_use_cases = map(package_use_cases, function(pkg_use) {
      pkg_use %>% 
        pivot_longer(
          cols = everything(),
          names_to = "package",
          values_to = "use_cases",
          names_prefix = "package_"
        ) %>% 
        separate_rows(use_cases, sep = "; ?") %>% 
        mutate(use_cases = as.integer(str_remove(use_cases, "use case "))) %>% 
        pivot_wider(
          names_from = package, 
          values_from = use_cases, 
          values_fn = list
        ) %>% 
        as.list()
    })
  )

In the end, the package use cases are tucked nicely into one cell each.

survey_packaged
#> # A tibble: 2 x 2
#>   person package_use_cases
#>    <dbl> <list>           
#> 1      1 <named list [2]> 
#> 2      2 <named list [2]>
str(survey_packaged$package_use_cases[[1]])
#> List of 2
#>  $ dplyr:List of 1
#>   ..$ : int [1:2] 1 3
#>  $ tidyr:List of 1
#>   ..$ : int [1:2] 1 2

joels · September 24, 2020, 12:44am

Here are a couple of options. Let me know if these are on the right track.

# Set up fake data
library(tidyverse)

d = tribble(
  ~id, ~q1, ~q2, ~q3, ~v3, ~q4,
  0, "a", "b", "x", 1, "c",
  0, "a", "b", "y", 2, "c",
  0, "a", "b", "z", 3, "c"
)

Option 1. In summarise, return a list within a named list. The output is as requested in your example:

d2 = d %>% 
  group_by(across(-ends_with("3"))) %>% 
  summarise(qv3 = list(as.list(set_names(v3, q3))))

d2

# A tibble: 1 x 5
# Groups:   id, q1, q2 [1]
     id q1    q2    q4    qv3             
  <dbl> <chr> <chr> <chr> <list>          
1     0 a     b     c     <named list [3]>

d2$qv3

[[1]]
[[1]]$x
[1] 1

[[1]]$y
[1] 2

[[1]]$z
[1] 3

# Recover the original data frame
d2 %>% unnest(qv3) %>% unnest(qv3)

Option 2. In this case, the nested objects are data frames with only one level of nesting:

d3 = d %>% 
  group_nest(across(-ends_with("3")), .key="qv3")

# group_nest() is experimental, but this is equivalent
d3 = d %>% 
  group_by(across(-ends_with("3"))) %>% 
  nest(qv3=c(q3, v3))

d3

     id q1    q2    q4                   qv3
  <dbl> <chr> <chr> <chr> <list<tbl_df[,2]>>
1     0 a     b     c                [3 × 2]

d3$qv3

<list_of<
  tbl_df<
    q3: character
    v3: double
  >
>[1]>
[[1]]
# A tibble: 3 x 2
  q3       v3
  <chr> <dbl>
1 x         1
2 y         2
3 z         3

# Recover the original data frame
d3 %>% unnest(qv3)

kentmkevin · September 24, 2020, 1:01am

Is this what you are looking for? I think the main issue with what you did is that it's iterating through each row with the map before combining them into one list. So splitting into two steps seems to do the trick

survey_data <- tribble(
  ~id, ~q1, ~q2, ~q3, ~v3, ~q4,
  0, "a", "b", "x", 1, "c",
  0, "a", "b", "y", 2, "c",
  0, "a", "b", "z", 3, "c"
)

survey_data %>%
  group_by(id) %>%
  summarise(across(c(q1:q2, q4), max), q3 = list(q3), v3 = list(v3)) %>%
  mutate(q3 = map2(q3, v3, .f = function(x, y) setNames(as.list(y), x))) %>%
  select(-v3)

bearloga · September 24, 2020, 1:23pm

Thank you @grrrck @joels @kentmkevin! Interesting and varied solutions; looks like I'm gonna have to learn more about nest/unnest.

@joels: Option 1 with group_by(across(-ends_with("3"))) is exactly it, thanks! I didn't realize you can use across() inside group_by() like that, so that blew my mind!

system · October 1, 2020, 1:23pm

This topic was automatically closed 7 days after the last reply. New replies are no longer allowed.

If you have a query related to it or one of the replies, start a new topic and refer back with a link.