pivot_longer with multiple sets of columns

joels · February 20, 2020, 6:12pm

I have wide data with multiple sets of columns that I want to convert to long format. It seems like pivot_longer should be able to handle this, but either I haven't found the right example or I haven't understood the examples I've found. (The help for pivot_longer says it's intended to be simpler to use than gather, but, except where pivot_longer does the same thing as gather (which I find easy to use), I find it difficult to fathom.)

Here's my use case: I have a data frame d that I receive in the following form (reproducible data is at the end of this question):

d

   ID EXT ORG 1           INSTITUTION 1 TCE STATUS 1 EXT ORG 2           INSTITUTION 2 TCE STATUS 2 EXT ORG 3
1 001 C30000191  American River College         <NA> C30000192 Sacramento City College         <NA>      <NA>
2 004 C30000192 Sacramento City College         <NA> C30006210         Folsom Lake Col         <NA>      <NA>
3 003 C30000156       Diablo Valley Col       Posted C30000192 Sacramento City College         <NA>      <NA>
4 002 C30000171       Grossmont College         <NA> C30000191  American River College       Posted C30000192
            INSTITUTION 3 TCE STATUS 3 EXT ORG 10    INSTITUTION 10 TCE STATUS 10
1                    <NA>         <NA>       <NA>              <NA>          <NA>
2                    <NA>         <NA>       <NA>              <NA>          <NA>
3                    <NA>         <NA>       <NA>              <NA>          <NA>
4 Sacramento City College       Posted  C30006268 Woodland Comm Col        Posted

Each row is a unique ID, followed by four sets of three columns whose prefixes are EXT ORG, INSTITUTION, and TCE STATUS. Each set of three columns is uniquely identified by the final number, which is either 1 or 2 digits. (My real data has 10 sets of three columns, but this simplified version covers my real use case.)

I want to covert this to long format, which I'm currently doing as follows:

library(tidyverse)

c(1:3,10) %>%
  map_df(~d %>% 
           select(ID, matches(paste0(.x,"$"))) %>%
           filter_at(vars(matches("EXT ORG")), all_vars(!is.na(.))) %>%
           rename_at(vars(matches("[0-9]{1,2}")), ~ gsub(" [0-9]{1,2}", "", .)) %>%
           mutate(institution.number=.x)) %>% 
  arrange(ID, institution.number)

   ID    `EXT ORG` INSTITUTION             `TCE STATUS` institution.number
   <chr> <chr>     <chr>                   <chr>                     <dbl>
 1 001   C30000191 American River College  <NA>                          1
 2 001   C30000192 Sacramento City College <NA>                          2
 3 002   C30000171 Grossmont College       <NA>                          1
 4 002   C30000191 American River College  Posted                        2
 5 002   C30000192 Sacramento City College Posted                        3
 6 002   C30006268 Woodland Comm Col       Posted                       10
 7 003   C30000156 Diablo Valley Col       Posted                        1
 8 003   C30000192 Sacramento City College <NA>                          2
 9 004   C30000192 Sacramento City College <NA>                          1
10 004   C30006210 Folsom Lake Col         <NA>                          2

I'd like to switch to pivot_longer, which I hope and expect will require less code. I've tried various approaches with pivot_longer, for example,

d %>%
  pivot_longer(cols=`EXT ORG 1`:`TCE STATUS 10`,
               names_to=c("EXT ORG", "INSTITUTION", "TCE STATUS", ".value"),
               names_pattern="(.* ?.*) ([0-9]{1,2}$)")
# Error: `regex` should define 4 groups; 2 found.

none of which worked.

So, my question is, how can I do this with pivot_longer?

Reproducible data:

d = structure(list(ID = c("001", "004", "003", "002"), `EXT ORG 1` = c("C30000191", 
"C30000192", "C30000156", "C30000171"), `INSTITUTION 1` = c("American River College", 
"Sacramento City College", "Diablo Valley Col", "Grossmont College"
), `TCE STATUS 1` = c(NA, NA, "Posted", NA), `EXT ORG 2` = c("C30000192", 
"C30006210", "C30000192", "C30000191"), `INSTITUTION 2` = c("Sacramento City College", 
"Folsom Lake Col", "Sacramento City College", "American River College"
), `TCE STATUS 2` = c(NA, NA, NA, "Posted"), `EXT ORG 3` = c(NA, 
NA, NA, "C30000192"), `INSTITUTION 3` = c(NA, NA, NA, "Sacramento City College"
), `TCE STATUS 3` = c(NA, NA, NA, "Posted"), `EXT ORG 10` = c(NA, 
NA, NA, "C30006268"), `INSTITUTION 10` = c(NA, NA, NA, "Woodland Comm Col"
), `TCE STATUS 10` = c(NA, NA, NA, "Posted")), row.names = c(NA, 
-4L), class = c("tbl_df", "tbl", "data.frame"))

abiyug · February 20, 2020, 6:40pm

Try...

library(tidyverse)

d %>%
  janitor::clean_names() %>% 
  tidyr::pivot_longer(cols = ext_org_1:tce_status_10, 
                      names_to = "var_name", 
                      values_to = "values",    
            values_drop_na = TRUE)

dromano · February 20, 2020, 8:34pm

Since each institution is uniquely determined by the value of ID and the number 'suffix', you can use them for the id_cols argument of pivot_wider() (after pivot_longer() is first applied:

d = structure(
  list(
    ID = c("001", "004", "003", "002"), 
    `EXT ORG 1` = c("C30000191", "C30000192", "C30000156", "C30000171"), 
    `INSTITUTION 1` = c("American River College", "Sacramento City College", "Diablo Valley Col", "Grossmont College"), 
    `TCE STATUS 1` = c(NA, NA, "Posted", NA), `EXT ORG 2` = c("C30000192", "C30006210", "C30000192", "C30000191"), 
    `INSTITUTION 2` = c("Sacramento City College", "Folsom Lake Col", "Sacramento City College", "American River College"), 
    `TCE STATUS 2` = c(NA, NA, NA, "Posted"), 
    `EXT ORG 3` = c(NA, NA, NA, "C30000192"), 
    `INSTITUTION 3` = c(NA, NA, NA, "Sacramento City College"), 
    `TCE STATUS 3` = c(NA, NA, NA, "Posted"), `EXT ORG 10` = c(NA,NA, NA, "C30006268"), 
    `INSTITUTION 10` = c(NA, NA, NA, "Woodland Comm Col"), 
    `TCE STATUS 10` = c(NA, NA, NA, "Posted")), 
  row.names = c(NA, -4L), 
  class = c("tbl_df", "tbl", "data.frame")
)

library(tidyverse)
d %>% 
  pivot_longer(-1) %>% 
  mutate(number = str_extract(name, '[0-9]+')) %>% 
  separate(name, into = 'name', sep = ' [0-9]+') %>% 
  pivot_wider(id_cols = c(ID, number), names_from = name, values_from = value)
#> Warning: Expected 1 pieces. Additional pieces discarded in 48 rows [1, 2,
#> 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, ...].
#> # A tibble: 16 x 5
#>    ID    number `EXT ORG` INSTITUTION             `TCE STATUS`
#>    <chr> <chr>  <chr>     <chr>                   <chr>       
#>  1 001   1      C30000191 American River College  <NA>        
#>  2 001   2      C30000192 Sacramento City College <NA>        
#>  3 001   3      <NA>      <NA>                    <NA>        
#>  4 001   10     <NA>      <NA>                    <NA>        
#>  5 004   1      C30000192 Sacramento City College <NA>        
#>  6 004   2      C30006210 Folsom Lake Col         <NA>        
#>  7 004   3      <NA>      <NA>                    <NA>        
#>  8 004   10     <NA>      <NA>                    <NA>        
#>  9 003   1      C30000156 Diablo Valley Col       Posted      
#> 10 003   2      C30000192 Sacramento City College <NA>        
#> 11 003   3      <NA>      <NA>                    <NA>        
#> 12 003   10     <NA>      <NA>                    <NA>        
#> 13 002   1      C30000171 Grossmont College       <NA>        
#> 14 002   2      C30000191 American River College  Posted      
#> 15 002   3      C30000192 Sacramento City College Posted      
#> 16 002   10     C30006268 Woodland Comm Col       Posted

^{Created on 2020-02-20 by the reprex package (v0.3.0)}

Does this help?

joels · February 20, 2020, 8:55pm

Thanks to @abiyug and @dromano for the key insight: pivot to "longest" first, then pivot_wider to the desired final shape. Here's the most succinct code I was able to come up with for that:

d %>% 
  pivot_longer(-ID, values_drop_na=TRUE) %>% 
  separate(name, into=c("name", "num"), sep=" (?=[0-9])") %>% 
  pivot_wider(names_from=name, values_from=value)

I'd still like to know if there's a way to do this directly in a single pivot_longer step, like this:

d %>% pivot_longer(-ID, ...dark magic...)

nirgrahamuk · February 20, 2020, 9:15pm

I doubt theres a way of getting an out of the box 1-liner considering the specifics of your desired final result, which is necessitating your splitting on name and number. unless, you produce your own function, tailored for this pattern...

piv_long_wdark_magic <- function(df,
                                idvar){
  id <- substitute(idvar)
  df %>% 
    pivot_longer(-id, values_drop_na=TRUE) %>% 
    separate(name, into=c("name", "num"), sep=" (?=[0-9])") %>% 
    pivot_wider(names_from=name, values_from=value)
}

piv_long_wdark_magic(d, ID)

abiyug · February 20, 2020, 9:24pm

Yes, if you necessitate splitting name/number, I can not think of another way. I suggest you take a look at the base "reshape" function.

dromano · February 20, 2020, 9:31pm

I just realized you can compress the first two operations into one pivot_longer() like this:

d %>% 
  pivot_longer(
    cols = -ID, 
    names_to = c('name', 'number'), 
    names_pattern = '([^0-9]+) ([0-9]+)'
    )

but that still requires pivot_wider(). It seemed a little terse at first, but is growing on me...

woodward · February 20, 2020, 11:42pm

I'm another one who finds gather and spread a lot more intuitive than the new pivot functions.

joels · March 12, 2020, 11:42pm

This topic was automatically closed 21 days after the last reply. New replies are no longer allowed.