TL;DR
chop()
, when used in conjunction with other tidyr and dplyr functions, is useful for combining several related logical columns (a common data structure for "check all that apply" survey questions) into a single list-column of character vectors. The resulting data structure can be used with purrr functions to create other useful/interesting variables.
Setup
Consider the following pair of survey questions:
What is your race? Please check as many as apply.
American Indian or Alaskan Native
Asian
Black or African American
Native Hawaiian or other Pacific Islander
White or European American
Are you Hispanic, Latino, or Spanish in origin?
- Yes
- No
Data
These two questions result in the following data.
library(tidyverse)
df <- tribble(
~id, ~race_amind, ~race_asian, ~race_black, ~race_hiopi, ~race_white, ~hispanic,
1, FALSE, TRUE, TRUE, FALSE, FALSE, FALSE,
2, FALSE, FALSE, FALSE, FALSE, TRUE, TRUE,
3, FALSE, FALSE, FALSE, FALSE, TRUE, FALSE
)
df
#> # A tibble: 3 x 7
#> id race_amind race_asian race_black race_hiopi race_white hispanic
#> <dbl> <lgl> <lgl> <lgl> <lgl> <lgl> <lgl>
#> 1 1 FALSE TRUE TRUE FALSE FALSE FALSE
#> 2 2 FALSE FALSE FALSE FALSE TRUE TRUE
#> 3 3 FALSE FALSE FALSE FALSE TRUE FALSE
Pivot longer
The race question data is split across several columns with trace_
prefix. Ideally, we'd want to act on this as a single variable, so let's pivot the data to a longer format.
df_long <- df %>%
pivot_longer(
cols = starts_with("race_"),
names_to = "race",
names_pattern = "race_?(.*)"
)
df_long
#> # A tibble: 15 x 4
#> id hispanic race value
#> <dbl> <lgl> <chr> <lgl>
#> 1 1 FALSE amind FALSE
#> 2 1 FALSE asian TRUE
#> 3 1 FALSE black TRUE
#> 4 1 FALSE hiopi FALSE
#> 5 1 FALSE white FALSE
#> 6 2 TRUE amind FALSE
#> 7 2 TRUE asian FALSE
#> 8 2 TRUE black FALSE
#> 9 2 TRUE hiopi FALSE
#> 10 2 TRUE white TRUE
#> 11 3 FALSE amind FALSE
#> 12 3 FALSE asian FALSE
#> 13 3 FALSE black FALSE
#> 14 3 FALSE hiopi FALSE
#> 15 3 FALSE white TRUE
Filter shorter
Okay, great, but now we have a bunch of FALSE
s where respondents didn't check a certain race. Let's filter those out.
df_short <- df_long %>%
filter(value) %>%
select(-value)
df_short
#> # A tibble: 4 x 3
#> id hispanic race
#> <dbl> <lgl> <chr>
#> 1 1 FALSE asian
#> 2 1 FALSE black
#> 3 2 TRUE white
#> 4 3 FALSE white
Chop
The dataframe is a bit shorter and more legible, but ethnicity
is still repeated for respondents who checked more than one race. Here's where tidyr::chop()
comes in.
df_chopped <- df_short %>%
chop(race)
df_chopped
#> # A tibble: 3 x 3
#> id hispanic race
#> <dbl> <lgl> <list>
#> 1 1 FALSE <chr [2]>
#> 2 2 TRUE <chr [1]>
#> 3 3 FALSE <chr [1]>
Here's the crux of my argument. The documentation for chop()
and unchop()
reads:
Generally, unchopping is more useful than chopping because it simplifies a complex data structure, and
nest()
ing is usually more appropriate thatchop()
ing since it better preserves the connections between observations.
I think the sentiment is right here, but list columns of character vectors can be useful too, especially when combined with purrr's flavor of functional programming.
Examples
Let's look at two examples where our chopped dataframe is a useful data structure.
Multi-race
First, let's recode respondents who checked more than one race into a mutli-race category.
df_chopped %>%
mutate(n_race = map_int(race, length)) %>%
mutate(race = ifelse(n_race == 1, unlist(race), "multi")) %>%
select(id, hispanic, race)
#> # A tibble: 3 x 3
#> id hispanic race
#> <dbl> <lgl> <chr>
#> 1 1 FALSE multi
#> 2 2 TRUE black
#> 3 3 FALSE white
Underrepresented Minority
Second, let's determine which respondents are underrepresented minorities (URM) based on hispanic
and race
. We'll define URM individuals as those selecting Hispanic or checking any non-white race.
df_chopped %>%
mutate(any_non_white = map_lgl(race, ~any(. != "white"))) %>%
mutate(urm = hispanic | any_non_white)
#> # A tibble: 3 x 5
#> id hispanic race any_non_white urm
#> <dbl> <lgl> <list> <lgl> <lgl>
#> 1 1 FALSE <chr [2]> TRUE TRUE
#> 2 2 TRUE <chr [1]> FALSE TRUE
#> 3 3 FALSE <chr [1]> FALSE FALSE
Wrap-up
I think I've outlined a decent use case for this lesser known function in tidyr. The code above, especially in the examples, expresses well what I'm trying to do. Do you agree? If you would tackle this problem differently, let me know!
Created on 2019-12-20 by the reprex package (v0.3.0)