summarizing a table by treatment types

Hey guys,

I'm new to R and have the following problem:
I have a data frame with columns that look like this:
Patient IDs: a list of patients (20,000) patients
Facility ID: xx2,yy2,zcc2...etc (hundreds of facilities)
Treatment: 1,2,3 (only 3 kinds of treatment)

Some facilities perform multiple kinds of treatments on different patients, while others performed only one kind of treatment, I want to to generate a vector or list that contains facilities that have delivered treatment 1 only, facilities that delivered treatment 2 only, facilities that delivered both treatment 1,2..etc

basically I'm trying to find out if the type of treatment a facility offers influences the type of treatment a person would get at a certain facility

how would i go about doing that ? Thank you

Hi, welcome!

Could you ask this with a minimal REPRoducible EXample (reprex)? A reprex makes it much easier for others to understand your issue and figure out how to help.

If you've never heard of a reprex before, you might want to start by reading this FAQ:

2 Likes

Sorry about that. Hopefully this will suffice:

df<-data.frame(PatientID=c(1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20),facilityID=c("z","x","y","w","w","x","y","z","z","x","y","w", "y","x","x","w", "y","y","y","y"),Treatment=c(1,2,2,1,1,2,3,2,3,1,2,3,2,1,2,3,2,2,2,2))

Basically in this example facility y can deliver treatment 2 only, facility z can deliver treatments 1,2,3, facility x can deliver treatments 1,2 only, facility w can deliver treatments 1,3 only. I basically want to create a new column that states whether the patient was treated a facility that can deliver treatment 1 only, treatment 2 only, treatment 3 only, or can deliver certain combinations of treatments 1,2,3

Thank you all!

I'm not sure if I understand you correctly, is this what you mean?

# Sample data
df <-data.frame(PatientID=c(1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20),
               facilityID=c("z","x","y","w","w","x","y","z","z","x","y","w",
                            "y","x","x","w", "y","y","y","y"),
               Treatment=c(1,2,2,1,1,2,3,2,3,1,2,3,2,1,2,3,2,2,2,2))

# Library calls
library(dplyr)

# Relevant code
df %>% 
    group_by(facilityID) %>% 
    distinct(Treatment) %>% 
    summarise(Available_Treatment = paste(Treatment, collapse = ",")) %>% 
    right_join(df, by = "facilityID")
#> # A tibble: 20 x 4
#>    facilityID Available_Treatment PatientID Treatment
#>    <fct>      <chr>                   <dbl>     <dbl>
#>  1 z          1,2,3                       1         1
#>  2 x          2,1                         2         2
#>  3 y          2,3                         3         2
#>  4 w          1,3                         4         1
#>  5 w          1,3                         5         1
#>  6 x          2,1                         6         2
#>  7 y          2,3                         7         3
#>  8 z          1,2,3                       8         2
#>  9 z          1,2,3                       9         3
#> 10 x          2,1                        10         1
#> 11 y          2,3                        11         2
#> 12 w          1,3                        12         3
#> 13 y          2,3                        13         2
#> 14 x          2,1                        14         1
#> 15 x          2,1                        15         2
#> 16 w          1,3                        16         3
#> 17 y          2,3                        17         2
#> 18 y          2,3                        18         2
#> 19 y          2,3                        19         2
#> 20 y          2,3                        20         2

Created on 2019-06-03 by the reprex package (v0.3.0)

If this is not what you are looking for, then please provide a sample of your desired output.

2 Likes

This looks great. The only additional thing I want to do is to create a new column and attach it to the dataframe that reflects whether the patient was treated at a facility that offers treatment 1 only, treatment 2 only, treatment 3 only, or can deliver certain combinations of treatments 1,2,3.

In other words, the new column can have 1 of 6 values (e.g. 001,010,100, 011,101,110,111- which correspond to treatment 1 alone, treatment 2 alone, treatment 3 alone, treatments 2 and 3, treatments 1 and 3, treatments 1 and 2 and finally treatments 1,2,3)

Thank you!

This is one way to do it, although, there are many other ways.

df <-data.frame(PatientID=c(1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20),
                facilityID=c("z","x","y","w","w","x","y","z","z","x","y","w",
                             "y","x","x","w", "y","y","y","y"),
                Treatment=c(1,2,2,1,1,2,3,2,3,1,2,3,2,1,2,3,2,2,2,2))

library(tidyverse)

df %>%
    group_by(facilityID) %>% 
    distinct(Treatment) %>% 
    spread(Treatment, Treatment, sep = "_") %>% 
    mutate_all(~ifelse(is.na(.), 0, 1)) %>% 
    transmute(new_col = paste0(Treatment_1, Treatment_2, Treatment_3)) %>% 
    right_join(df, by = "facilityID")
#> `mutate_all()` ignored the following grouping variables:
#> Column `facilityID`
#> Use `mutate_at(df, vars(-group_cols()), myoperation)` to silence the message.
#> # A tibble: 20 x 4
#> # Groups:   facilityID [4]
#>    facilityID new_col PatientID Treatment
#>    <fct>      <chr>       <dbl>     <dbl>
#>  1 z          111             1         1
#>  2 x          110             2         2
#>  3 y          011             3         2
#>  4 w          101             4         1
#>  5 w          101             5         1
#>  6 x          110             6         2
#>  7 y          011             7         3
#>  8 z          111             8         2
#>  9 z          111             9         3
#> 10 x          110            10         1
#> 11 y          011            11         2
#> 12 w          101            12         3
#> 13 y          011            13         2
#> 14 x          110            14         1
#> 15 x          110            15         2
#> 16 w          101            16         3
#> 17 y          011            17         2
#> 18 y          011            18         2
#> 19 y          011            19         2
#> 20 y          011            20         2
1 Like

Thank you so much, this looks great! However, when I tried to apply it to my dataset here's what happened

For your reference, these are the variable names I'm actually using:
df --->mutate1
treatment --->Z_TXT
facility ID ---> PUF_FACILITY_ID

I used this code:

library(tidyverse)
mutate1 %>%
    group_by(PUF_FACILITY_ID) %>%
    distinct(Z_TXT)
spread(Z_TXT, Z_TXT, sep="_") %>%
    mutate_all(~ifelse(is.na(.),0,1)) %>%
    transmute(Avail_TXT_F=paste0(Z_TXT_1, Z_TXT_2,Z_TXT_3)) %>%
    right_join(mutate1, by="PUF_FACILITY_ID")

Here is the output I'm getting:

> mutate1 %>%
+     group_by(PUF_FACILITY_ID) %>%
+     distinct(Z_TXT)
# A tibble: 470 x 2
# Groups:   PUF_FACILITY_ID [339]
   Z_TXT PUF_FACILITY_ID
   <int> <fct>          
 1     2 ULWONRBEXR     
 2     3 LVMXOQOZBB     
 3     1 OVRRTAMUER     
 4     1 RGHKVWHTVZ     
 5     3 CYMEVTQFDL     
 6     1 HAIFFICRAQ     
 7     1 TNKUIMHWRA     
 8     3 VTLOZOSOCH     
 9     1 SXAXVRHPLQ     
10     1 DYDZBPNRFA     
# ... with 460 more rows
> spread(Z_TXT, Z_TXT, sep="_") %>%
+     mutate_all(~ifelse(is.na(.),0,1)) %>%
+     transmute(Avail_TXT_F=paste0(Z_TXT_1, Z_TXT_2,Z_TXT_3)) %>%
+     right_join(mutate1, by="PUF_FACILITY_ID")
Error in spread(Z_TXT, Z_TXT, sep = "_") : object 'Z_TXT' not found

I tried using $ as well to assign the column to mutate1 and didn't work.

Any ideas?

You are missing a pipe operator %>% after distinct(Z_TXT)

1 Like

Oh thanks! It worked now, however, the variable wasn't added to the dataframe for some reason


> mutate1 %>%
+     group_by(PUF_FACILITY_ID) %>%
+     distinct(Z_TXT) %>%
+     spread(Z_TXT, Z_TXT, sep="_") %>%
+     mutate_all(~ifelse(is.na(.),0,1)) %>%
+     transmute(Avail_TXT_F=paste0(Z_TXT_1, Z_TXT_2,Z_TXT_3)) %>%
+     right_join(mutate1, by="PUF_FACILITY_ID")
`mutate_all()` ignored the following grouping variables:
Column `PUF_FACILITY_ID`
Use `mutate_at(df, vars(-group_cols()), myoperation)` to silence the message.
# A tibble: 13,086 x 221
# Groups:   PUF_FACILITY_ID [339]
   PUF_FACILITY_ID Avail_TXT_F   AGE ANALYTIC_STAGE_~ BEHAVIOR CDCC_TOTAL CLASS_OF_CASE CROWFLY CS_EXTENSION CS_METS_AT_DX
   <fct>           <chr>       <int>            <int>    <int>      <int>         <int> <fct>          <int>         <int>
 1 ULWONRBEXR      111            52                1        3          0            20 194.0            220             0
 2 LVMXOQOZBB      101            45                2        3          0            10 21.0             440             0
 3 OVRRTAMUER      101            62                9        3          0            20 55.2             999             0
 4 RGHKVWHTVZ      101            55                1        3          0            10 20.6             300             0
 5 RGHKVWHTVZ      101            49                2        3          0            20 80.6             440             0
 6 ULWONRBEXR      111            55                2        3          0            20 515.0            420             0
 7 LVMXOQOZBB      101            69                3        3          0            14 22.6             405             0
 8 CYMEVTQFDL      101            68                2        3          0            10 94.4             440             0
 9 HAIFFICRAQ      101            65                2        3          0            12 30.7             255             0
10 TNKUIMHWRA      101            72                1        3          0            22 43.2             205             0
# ... with 13,076 more rows, and 211 more variables: CS_METS_DX_BONE <int>, CS_METS_DX_BRAIN <int>, CS_METS_DX_LIVER <int>,
#   CS_METS_DX_LUNG <int>, CS_METS_EVAL <int>, CS_SITESPECIFIC_FACTOR_1 <int>, CS_SITESPECIFIC_FACTOR_10 <fct>,
#   CS_SITESPECIFIC_FACTOR_11 <fct>, CS_SITESPECIFIC_FACTOR_12 <fct>, CS_SITESPECIFIC_FACTOR_13 <fct>,
#   CS_SITESPECIFIC_FACTOR_14 <fct>, CS_SITESPECIFIC_FACTOR_15 <fct>, CS_SITESPECIFIC_FACTOR_16 <fct>,
#   CS_SITESPECIFIC_FACTOR_17 <fct>, CS_SITESPECIFIC_FACTOR_18 <fct>, CS_SITESPECIFIC_FACTOR_19 <fct>,
#   CS_SITESPECIFIC_FACTOR_2 <int>, CS_SITESPECIFIC_FACTOR_20 <fct>, CS_SITESPECIFIC_FACTOR_21 <fct>,
#   CS_SITESPECIFIC_FACTOR_22 <fct>, CS_SITESPECIFIC_FACTOR_23 <fct>, CS_SITESPECIFIC_FACTOR_24 <fct>,
#   CS_SITESPECIFIC_FACTOR_25 <fct>, CS_SITESPECIFIC_FACTOR_3 <int>, CS_SITESPECIFIC_FACTOR_4 <int>,
#   CS_SITESPECIFIC_FACTOR_5 <int>, CS_SITESPECIFIC_FACTOR_6 <int>, CS_SITESPECIFIC_FACTOR_7 <fct>,
#   CS_SITESPECIFIC_FACTOR_8 <fct>, CS_SITESPECIFIC_FACTOR_9 <fct>, CS_TUMOR_SIZEEXT_EVAL <int>, CS_VERSION_LATEST <fct>,
#   DIAGNOSTIC_CONFIRMATION <int>, DX_CHEMO_STARTED_DAYS <fct>, DX_DEFSURG_STARTED_DAYS <fct>, DX_HORMONE_STARTED_DAYS <fct>,
#   DX_IMMUNO_STARTED_DAYS <fct>, DX_LASTCONTACT_DEATH_MONTHS <fct>, DX_OTHER_STARTED_DAYS <fct>, DX_RAD_STARTED_DAYS <fct>,
#   DX_RX_STARTED_DAYS <fct>, DX_STAGING_PROC_DAYS <fct>, DX_SURG_STARTED_DAYS <fct>, DX_SYSTEMIC_STARTED_DAYS <fct>,
#   Dist_Mets <int>, FACILITY_LOCATION_CD <fct>, FACILITY_TYPE_CD <fct>, GRADE <int>, HISTOLOGY <int>, INSURANCE_STATUS <int>,
#   LATERALITY <int>, LYMPH_VASCULAR_INVASION <int>, MED_INC_QUAR_00 <fct>, MED_INC_QUAR_12 <fct>, M_ACADMC <int>,
#   M_DIST_60 <int>, M_EOE_C_T_6 <int>, M_EOE_C_T_7 <int>, M_EOE_Max <int>, M_HS <int>, M_INSURANCE <int>, M_MED_INC <int>,
#   M_RACE_W <int>, M_RURAL_250 <int>, M_SZ_C_T_6 <int>, NO_HSD_QUAR_00 <fct>, NO_HSD_QUAR_12 <fct>, O_SZ_COMS_OLD <int>,
#   O_SZ_COMS_new <int>, O_SZ_C_T_6 <int>, O_SZ_C_T_7 <int>, O_SZ_MAX_new_stage <int>, O_SZ_MAX_old_stage <int>,
#   O_SZ_P_T_6 <int>, O_SZ_P_T_7 <int>, PALLIATIVE_CARE <int>, PALLIATIVE_CARE_HOSP <int>, PRIMARY_SITE <fct>,
#   PUF_30_DAY_MORT_CD <fct>, PUF_90_DAY_MORT_CD <fct>, PUF_CASE_ID <fct>, PUF_MULT_SOURCE <int>, PUF_VITAL_STATUS <fct>,
#   RACE <int>, RAD_BOOST_DOSE_CGY <int>, RAD_BOOST_RX_MODALITY <int>, RAD_ELAPSED_RX_DAYS <int>, RAD_LOCATION_OF_RX <int>,
#   RAD_NUM_TREAT_VOL <fct>, RAD_REGIONAL_DOSE_CGY <int>, RAD_REGIONAL_RX_MODALITY <int>, RAD_TREAT_VOL <int>,
#   READM_HOSP_30_DAYS <int>, REASON_FOR_NO_RADIATION <int>, REASON_FOR_NO_SURGERY <int>, REFERENCE_DATE_FLAG <int>,
#   REGIONAL_NODES_EXAMINED <int>, REGIONAL_NODES_POSITIVE <int>, RX_HOSP_CHEMO <int>, RX_HOSP_DXSTG_PROC <int>, ...

I tried to run the following operation, and it gave me an error message

with(mutate1, table(Z_TXT, Z_SZ_class, Avail_TXT_F))
Error in table(Z_TXT, Z_SZ_class, Avail_TXT_F) : 
  object 'Avail_TXT_F' not found

tidyverse tools don't perform in-place modifications, you have to explicitly assign them to the same or another variable name with the assign operator <-

mutate1 <- mutate1 %>%
# the rest of the code
1 Like

Thank you sir! You're a life saver

If your question's been answered (even by you!), would you mind choosing a solution? It helps other people see which questions still need help, or find solutions if they have similar problems. Here’s how to do it:

1 Like

This topic was automatically closed 7 days after the last reply. New replies are no longer allowed.