Multicolumn melt challenge

joescho · January 18, 2020, 6:18pm

Hi there.
(First post, first time with reprex & dput - hope I've done right. If not, code is at https://gist.github.com/joeschofield0/99669d0850539e3dd2333c844813bb63)

I'm trying to melt a df that contains data (drug name, dose, date etc) on multiple treatment episodes per patient. I would like to end up with one row per treatment episode per patient. I am able to melt down based on one id.var, but struggling to get the measure.vars of interest. The R script below generates a sample df and shows where I've got to with melt.

Here is an example of the structure I am aiming for:

structure(list(patient = c(1L, 1L, 2L, 2L, 3L, 3L), gender = structure(c(2L, 
2L, 1L, 1L, 1L, 1L), .Label = c("F", "M"), class = "factor"), 
    age = c(39L, 39L, 29L, 29L, 59L, 59L), treatment = c(1L, 
    2L, 1L, 2L, 1L, 2L), drug = structure(c(1L, 3L, 1L, 2L, 1L, 
    3L), .Label = c("A", "B", "D"), class = "factor"), dose = c(83L, 
    19L, 84L, 111L, 38L, 55L), dosecat = structure(c(3L, 3L, 
    2L, 3L, 1L, 2L), .Label = c("high", "low", "med"), class = "factor"), 
    date = structure(c(3L, 1L, 4L, 2L, 6L, 5L), .Label = c("01/12/2019", 
    "08/01/2020", "09/01/2020", "10/11/2019", "23/11/2019", "29/11/2019"
    ), class = "factor")), class = "data.frame", row.names = c(NA, 
-6L))

I've read the melt man page, checked several multi-column melt posts here and on StackExchange, but struggling to get the desired format output. Grateful for any pointers.

library(reshape2)

set.seed(3)
patient <- c(1:10)
gender <- sample(c("M", "F"), size = 10, replace = TRUE)
age <- sample(20:60, size = 10, replace = TRUE)
drug1 <- sample(c("A", "B", "C"), 10, replace = TRUE)
dose1 <- sample(10:200, size = 10, replace = TRUE)
dosecat1 <- sample(c("low", "med", "high"), size = 10, replace = TRUE)
date1 <- Sys.Date() + sample(-1:-90, 10)
drug2 <- sample(c("B", "D", "E"), 10, replace = TRUE)
dose2 <- sample(10:200, size = 10, replace = TRUE)
dosecat2 <- sample(c("low", "med", "high"), size = 10, replace = TRUE)
date2 <- Sys.Date() + sample(-1:-90, 10)

df <- data.frame(patient, gender, age, drug1, dose1, dosecat1, date1, drug2, dose2, dosecat2, date2)

# -- Melt attempt 1 --
# Warning: Attributes are not identical across variables (drug = chr, dose = int)
# df_long <- melt(df, id.vars = "patient", measure.vars = c("drug1", "dose1", "drug2", "dose2"))

# -- Melt attempt 2 --
# Getting closer, but losing gender, age, dose, date columns
df_long <- melt(df, id.vars = "patient", measure.vars = c("drug1", "drug2"))
#> Warning: attributes are not identical across measure variables; they will be
#> dropped
df_long[order(df_long$patient), ]
#>    patient variable value
#> 1        1    drug1     A
#> 11       1    drug2     D
#> 2        2    drug1     A
#> 12       2    drug2     B
#> 3        3    drug1     A
#> 13       3    drug2     D
#> 4        4    drug1     B
#> 14       4    drug2     B
#> 5        5    drug1     B
#> 15       5    drug2     B
#> 6        6    drug1     A
#> 16       6    drug2     D
#> 7        7    drug1     B
#> 17       7    drug2     B
#> 8        8    drug1     C
#> 18       8    drug2     D
#> 9        9    drug1     B
#> 19       9    drug2     E
#> 10      10    drug1     C
#> 20      10    drug2     E

^{Created on 2020-01-18 by the reprex package (v0.3.0)}

joels · January 18, 2020, 7:15pm

The reshape2 package is no longer under active development. The melt function from reshape2 was replaced a while back with gather from the tidyr package. However, gather can't handle complex reshaping such as your example. For that we need an even newer function, pivot_longer (also from tidyr).

pivot_longer can handle a wide range of data reshaping operations, but that flexibility comes at the cost of a somewhat more complex syntax. See the pivoting vignette for detailed examples.

In the code below, we first use rename_at to add an underscore to the event variable names so that pivot_longer knows where to separate the events (each instance of a drug being administered) from the variable names we want in the pivoted data frame. For example, drug1 gets converted to drug_1, dosecat1 to dosecat_1, etc. Then, pivot_longer takes care of the "melting" to longer format.

For a pivoting example similar to yours, see the Multiple Observations per Row section of the pivoting vignette.

library(tidyverse) # Loads tidyr and several other "tidyverse" packages

df.long = df %>% 
  rename_at(vars(matches("[0-9]$")), ~gsub("([0-9])", "_\\1", .)) %>% 
  pivot_longer(cols=-(patient:age),
               names_sep="_",
               names_to=c(".value", "event"))

The rename_at step can be avoided by using a regular expression called a positive lookahead. In the code below "(?=[0-9])" identifies the zero-length location between the letters and the final number in the event columns (dose1, dose2, date1, date2, etc.) thereby telling pivot_longer how to create the column names in the longer data frame.

df.long = df %>% 
  pivot_longer(cols=-(patient:age),
               names_sep="(?=[0-9])",
               names_to=c(".value", "event"))

Either way, the result is:

df.long

   patient gender   age event drug   dose dosecat date      
     <int> <fct>  <int> <chr> <fct> <int> <fct>   <date>    
 1       1 M         40 1     A        82 low     2019-12-28
 2       1 M         40 2     E       155 high    2020-01-16
 3       2 F         40 1     A        81 high    2019-12-30
 4       2 F         40 2     B       155 med     2019-12-25
 5       3 M         41 1     A        42 low     2019-11-01
 6       3 M         41 2     E       182 med     2019-12-10
 7       4 M         42 1     A        96 med     2019-10-23
 8       4 M         42 2     B       194 low     2019-11-06
 9       5 F         55 1     A        59 med     2019-11-06
10       5 F         55 2     E       108 high    2019-11-04
11       6 F         54 1     C        74 low     2019-10-22
12       6 F         54 2     B       114 high    2019-12-27
13       7 M         24 1     B       179 low     2019-12-09
14       7 M         24 2     E        41 low     2019-12-21
15       8 M         48 1     C        48 low     2019-10-21
16       8 M         48 2     D        41 med     2019-12-23
17       9 F         56 1     B       120 low     2020-01-07
18       9 F         56 2     D       160 med     2020-01-02
19      10 F         31 1     C        49 high    2019-12-26
20      10 F         31 2     B       153 high    2019-11-23

FJCC · January 18, 2020, 7:41pm

Edit: Oops, I didn't see an answer had come in while is was cursing at reprex.

Here is a solution using the tidyr package. I apologize for not using a Reprex but my reprex package seems to have gone insane at the moment.

library(tidyr)
library(stringr)
set.seed(3)
patient <- c(1:10)
gender <- sample(c("M", "F"), size = 10, replace = TRUE)
age <- sample(20:60, size = 10, replace = TRUE)
drug1 <- sample(c("A", "B", "C"), 10, replace = TRUE)
dose1 <- sample(10:200, size = 10, replace = TRUE)
dosecat1 <- sample(c("low", "med", "high"), size = 10, replace = TRUE)
date1 <- Sys.Date() + sample(-1:-90, 10)
drug2 <- sample(c("B", "D", "E"), 10, replace = TRUE)
dose2 <- sample(10:200, size = 10, replace = TRUE)
dosecat2 <- sample(c("low", "med", "high"), size = 10, replace = TRUE)
date2 <- Sys.Date() + sample(-1:-90, 10)

df <- data.frame(patient, gender, age, drug1, dose1, dosecat1, date1, drug2, dose2, dosecat2, date2)

colnames(df) <- str_replace(colnames(df), pattern = "\\d", replacement = "_\\0")
df2 <- pivot_longer(df, cols = matches("\\d$"), 
                      names_to = c(".value", "event"), names_sep = "_")
head(df2)

The output of the last line is

# A tibble: 6 x 8
  patient gender   age event drug   dose dosecat date      
    <int> <fct>  <int> <chr> <fct> <int> <fct>   <date>    
1       1 M         40 1     A        82 low     2019-12-28
2       1 M         40 2     E       155 high    2020-01-16
3       2 F         40 1     A        81 high    2019-12-30
4       2 F         40 2     B       155 med     2019-12-25
5       3 M         41 1     A        42 low     2019-11-01
6       3 M         41 2     E       182 med     2019-12-10

joescho · January 18, 2020, 8:49pm

Thanks v. much joels & FJCC I've checked out the pivot_longer man page and vignettes and this looks ideal, I'm especially taken with the manual spec functionality - a v. powerful Tidy tool.

The Cran project vignette page euphemistically says "the creativity that people apply to their data structures is seemingly endless". Boy, does that apply to some of the Excel-based datasets I get from colleagues.

Thanks again for the info, pointers and examples guys.

system · January 25, 2020, 8:49pm

This topic was automatically closed 7 days after the last reply. New replies are no longer allowed.