Help with Collapsing and Reshaping Data Frame

Hi All,

Apologies if this has already been covered (and I'm sure it has) I just cannot find the right answer despite extensive searching. I have a dataset with about 15,000 patients. Each patient has a unique ID , the meds they are on are recorded at a particular time post-transplant (e.g Initial, 3 Mth, 1 Yr, 3Yr) on individual rows. Each patient is usually on three different immunosuppressant medications, each medication is represented by a letter (see below). I am trying to group the initial immunosuppression regimens into one variable so that differences in a different outcome (not shown) can be examined between those on particular immunosuppression combinations.

transplantid drugcode monthcode
1 76 P Initial
2 76 V Initial
3 76 X Initial
4 76 P 3 Mth
5 76 V 3 Mth
6 76 X 3 Mth
7 77. P Initial

I am trying to collapse and reshape the data to have a single row for each patient according to their unique ID that includes their immunosuppression in one variable, this should be the first recorded observations post-transplant i.e. initial but not all patients had their immunosuppression recorded initially. For some the first recording was at 3 Mth or 1 Yr, monthcode is a factor variable.

Ideal Output would look something like

transplantid drugcodes monthcode
1 1 PVX Initial
2 2 VXU Initial
3 3 XPV Initial
4 4 PVX Initial
5 5 VCP Initial
6 6 XCP Initial

Many thanks in advance for any help or helpful posts people may know of

Ryan

You should provide a reproducible example so that it is easier to provide you with an answer.

But anyway, I assume that it looks like this with the first column being a row number?

library(tidyverse)
df <- tibble::tribble(
  ~row, ~transplantid, ~drugcode, ~monthcode,
    1L,           76L,       "P",  "Initial",
    2L,           76L,       "V",  "Initial",
    3L,           76L,       "X",  "Initial",
    4L,           76L,       "P",    "3 Mth",
    5L,           76L,       "V",    "3 Mth",
    6L,           76L,       "X",    "3 Mth",
    7L,           77L,       "P",  "Initial"
  )

But then how do you get from here to the second bit? Where does VXU come from?

Sorry I gave a bad example, letters were innacurate as I had freetexted the desired output.

Actual tibble of first 30 rows below.

the first twelve rows belong to same patient - desired single row for that patient would be "PVX". Second patient would be "EPV". Drug regimens can change throughout the post-transplant course so wanted to get the earliest observations.

> print(as_tibble(DrugDose), n = 30), 

# A tibble: 274,691 x 3
   transplantid drugcode monthcode
          <int> <fct>    <fct>    
 1       768865 P        Initial  
 2       768865 V        Initial  
 3       768865 X        Initial  
 4       768865 P        3 Mth    
 5       768865 V        3 Mth    
 6       768865 X        3 Mth    
 7       768865 P        1 Yr     
 8       768865 V        1 Yr     
 9       768865 X        1 Yr     
10       768865 P        2 Yr     
11       768865 V        2 Yr     
12       768865 X        2 Yr     
13      1033869 E        Initial  
14      1033869 P        Initial  
15      1033869 V        Initial  
16      1033869 E        3 Mth    
17      1033869 P        3 Mth    
18      1033869 V        3 Mth    
19      1033869 E        1 Yr     
20      1033869 P        1 Yr     
21      1033869 V        1 Yr     
22      1033869 E        2 Yr     
23      1033869 P        2 Yr     
24      1033869 V        2 Yr     
25      1033869 E        3 Yr     
26      1033869 P        3 Yr     
27      1033869 V        3 Yr     
28      1033869 E        7 Yr     
29      1033869 H        7 Yr     
30      1033869 P        7 Yr     
# … with 274,661 more rows```

Is this what you mean?

library(dplyr)

DrugDose <- data.frame(
  stringsAsFactors = FALSE,
      transplantid = c(768865,768865,768865,768865,
                       768865,768865,768865,768865,768865,768865,768865,
                       768865,1033869,1033869,1033869,1033869,1033869,
                       1033869,1033869,1033869,1033869,1033869,1033869,1033869,
                       1033869,1033869,1033869,1033869,1033869,1033869),
          drugcode = c("P","V","X","P","V","X",
                       "P","V","X","P","V","X","E","P","V","E","P","V",
                       "E","P","V","E","P","V","E","P","V","E","H",
                       "P"),
         monthcode = c("Initial","Initial","Initial",
                       "3 Mth","3 Mth","3 Mth","1 Yr","1 Yr","1 Yr",
                       "2 Yr","2 Yr","2 Yr","Initial","Initial","Initial",
                       "3 Mth","3 Mth","3 Mth","1 Yr","1 Yr","1 Yr","2 Yr",
                       "2 Yr","2 Yr","3 Yr","3 Yr","3 Yr","7 Yr","7 Yr",
                       "7 Yr")
)

DrugDose %>% 
    group_by(transplantid) %>%
    filter(monthcode == first(monthcode)) %>% 
    summarise(drugcode = paste0(drugcode, collapse = ""),
              monthcode = first(monthcode))
#> `summarise()` ungrouping output (override with `.groups` argument)
#> # A tibble: 2 x 3
#>   transplantid drugcode monthcode
#>          <dbl> <chr>    <chr>    
#> 1       768865 PVX      Initial  
#> 2      1033869 EPV      Initial

Created on 2020-11-24 by the reprex package (v0.3.0.9001)

Note: Next time please provide a proper REPRoducible EXample (reprex) illustrating your issue.

Thanks Andre that code worked perfectly!

This topic was automatically closed 7 days after the last reply. New replies are no longer allowed.

If you have a query related to it or one of the replies, start a new topic and refer back with a link.