A use-case for `tidyr::chop()`: "check all that apply" survey questions

11rchitwood · December 20, 2019, 2:51pm

TL;DR

chop(), when used in conjunction with other tidyr and dplyr functions, is useful for combining several related logical columns (a common data structure for "check all that apply" survey questions) into a single list-column of character vectors. The resulting data structure can be used with purrr functions to create other useful/interesting variables.

Setup

Consider the following pair of survey questions:

What is your race? Please check as many as apply.
American Indian or Alaskan Native
Asian
Black or African American
Native Hawaiian or other Pacific Islander
White or European American

Are you Hispanic, Latino, or Spanish in origin?

Yes
No

Data

These two questions result in the following data.

library(tidyverse)

df <- tribble(
  ~id, ~race_amind, ~race_asian, ~race_black, ~race_hiopi, ~race_white, ~hispanic,
    1,       FALSE,        TRUE,        TRUE,       FALSE,       FALSE,     FALSE,
    2,       FALSE,       FALSE,       FALSE,       FALSE,        TRUE,      TRUE,
    3,       FALSE,       FALSE,       FALSE,       FALSE,        TRUE,     FALSE
)
df
#> # A tibble: 3 x 7
#>      id race_amind race_asian race_black race_hiopi race_white hispanic
#>   <dbl> <lgl>      <lgl>      <lgl>      <lgl>      <lgl>      <lgl>   
#> 1     1 FALSE      TRUE       TRUE       FALSE      FALSE      FALSE   
#> 2     2 FALSE      FALSE      FALSE      FALSE      TRUE       TRUE    
#> 3     3 FALSE      FALSE      FALSE      FALSE      TRUE       FALSE

Pivot longer

The race question data is split across several columns with trace_ prefix. Ideally, we'd want to act on this as a single variable, so let's pivot the data to a longer format.

df_long <- df %>%
  pivot_longer(
    cols = starts_with("race_"),
    names_to = "race",
    names_pattern = "race_?(.*)"
  )
df_long
#> # A tibble: 15 x 4
#>       id hispanic race  value
#>    <dbl> <lgl>    <chr> <lgl>
#>  1     1 FALSE    amind FALSE
#>  2     1 FALSE    asian TRUE 
#>  3     1 FALSE    black TRUE 
#>  4     1 FALSE    hiopi FALSE
#>  5     1 FALSE    white FALSE
#>  6     2 TRUE     amind FALSE
#>  7     2 TRUE     asian FALSE
#>  8     2 TRUE     black FALSE
#>  9     2 TRUE     hiopi FALSE
#> 10     2 TRUE     white TRUE 
#> 11     3 FALSE    amind FALSE
#> 12     3 FALSE    asian FALSE
#> 13     3 FALSE    black FALSE
#> 14     3 FALSE    hiopi FALSE
#> 15     3 FALSE    white TRUE

Filter shorter

Okay, great, but now we have a bunch of FALSEs where respondents didn't check a certain race. Let's filter those out.

df_short <- df_long %>%
  filter(value) %>%
  select(-value)
df_short
#> # A tibble: 4 x 3
#>      id hispanic race 
#>   <dbl> <lgl>    <chr>
#> 1     1 FALSE    asian
#> 2     1 FALSE    black
#> 3     2 TRUE     white
#> 4     3 FALSE    white

Chop

The dataframe is a bit shorter and more legible, but ethnicity is still repeated for respondents who checked more than one race. Here's where tidyr::chop() comes in.

df_chopped <- df_short %>%
  chop(race)
df_chopped
#> # A tibble: 3 x 3
#>      id hispanic race     
#>   <dbl> <lgl>    <list>   
#> 1     1 FALSE    <chr [2]>
#> 2     2 TRUE     <chr [1]>
#> 3     3 FALSE    <chr [1]>

Here's the crux of my argument. The documentation for chop() and unchop() reads:

Generally, unchopping is more useful than chopping because it simplifies a complex data structure, and nest() ing is usually more appropriate that chop() ing since it better preserves the connections between observations.

I think the sentiment is right here, but list columns of character vectors can be useful too, especially when combined with purrr's flavor of functional programming.

Examples

Let's look at two examples where our chopped dataframe is a useful data structure.

Multi-race

First, let's recode respondents who checked more than one race into a mutli-race category.

df_chopped %>%
  mutate(n_race = map_int(race, length)) %>%
  mutate(race = ifelse(n_race == 1, unlist(race), "multi")) %>%
  select(id, hispanic, race)
#> # A tibble: 3 x 3
#>      id hispanic race 
#>   <dbl> <lgl>    <chr>
#> 1     1 FALSE    multi
#> 2     2 TRUE     black
#> 3     3 FALSE    white

Underrepresented Minority

Second, let's determine which respondents are underrepresented minorities (URM) based on hispanic and race. We'll define URM individuals as those selecting Hispanic or checking any non-white race.

df_chopped %>%
  mutate(any_non_white = map_lgl(race, ~any(. != "white"))) %>%
  mutate(urm = hispanic | any_non_white)
#> # A tibble: 3 x 5
#>      id hispanic race      any_non_white urm  
#>   <dbl> <lgl>    <list>    <lgl>         <lgl>
#> 1     1 FALSE    <chr [2]> TRUE          TRUE 
#> 2     2 TRUE     <chr [1]> FALSE         TRUE 
#> 3     3 FALSE    <chr [1]> FALSE         FALSE

Wrap-up

I think I've outlined a decent use case for this lesser known function in tidyr. The code above, especially in the examples, expresses well what I'm trying to do. Do you agree? If you would tackle this problem differently, let me know!

^{Created on 2019-12-20 by the reprex package (v0.3.0)}

joels · December 20, 2019, 7:10pm

I wasn't aware of chop until I read your post. Without chop I think I would probably do something like the code below, which seems more straightforward to me. I'll have to play around with chop and get a better feel for how I might use it.

df %>% 
  pivot_longer(
    cols = starts_with("race_"),
    names_to = "race",
    names_pattern = "race_?(.*)"
  ) %>% 
  filter(value) %>% 
  group_by(id, hispanic) %>% 
  summarise(urm = any(race != "white" | hispanic),
            race_summary = ifelse(n() > 1, "multi", race),
            race_all = paste(race, collapse=", "),
            race = list(race))

     id hispanic urm   race_summary race_all     race     
  <dbl> <lgl>    <lgl> <chr>        <chr>        <list>   
1     1 FALSE    TRUE  multi        asian, black <chr [2]>
2     2 TRUE     TRUE  white        white        <chr [1]>
3     3 FALSE    FALSE white        white        <chr [1]>

BobMuenchen · December 26, 2019, 11:08am

See also the excellent vignette in the Multiple Response Categorical Variables, MRCV, package.

system · January 2, 2020, 11:08am

This topic was automatically closed 7 days after the last reply. New replies are no longer allowed.