How to modify categorical data in data set

For my project, I am working with the National Cancer Data Base.

I am trying to clean up a variable for the T stage of a cancer (based on the T,N,M staging guidelines) by combining different stages (ex: "1A
, 1A1, 1A2, ) into simple stages ("1",2,3,4).

How can I perform these changes in my uploaded data set ?

Hi, could you provide any example data?

Hi, yes, here is an example with non-real data based on the data base, which has ~500k cases and 313 columns for each variable:

TNM_CLIN_T (column name: T stage)
blank
cX (unknown)
blank
c4
c3
c2
1A2
c1

Hi Chris, I see your example, but I still haven't figured out what exactly the "modification" you want to do with the data. I guess there're two steps in your ideal process:

  1. extract the rows based on stages named 1A#
  2. simplify the stage names

The following code might be considered (suppose your data stored in df):

library(dplyr)
library(stringr)
df %>% filter(str_detect(TNM_CLIN_T, "^1A[0-9]+")) %>% 
  mutate(TNM_CLIN_T = str_remove(TNM_CLIN_T,"^1A"))

Hi Yifan, thanks for your help.

Yes, I believe I'm working with a data frame.

To clarify, I would like to replace every instance of "1MI", "c1MI", "1", "1A", "1A1", "1A2", "1B1", etc with just "1".

This includes probably 20 total different values, not just those with "1A[0-9]". Please let me know if this makes sense.

Hi Chris, since the replacements for 1 are complex, I suggest you build a lookup dictionary first, which is basically a data.frame, and then use join to match these simplified stages. The look-up dictionary is like:

stage_look_up <- list(
  `1` = c("1MI", "c1MI", "1", "1A", "1A1", "1A2", "1B1"),
  `2` = c("2MI", "c2MI", "2", "2A", "2A1", "2A2", "2B1")
) %>% enframe %>% unnest(value) %>% 
   rename(TNM_CLIN_T = value, stage.simplify = name)
> stage_look_up
# A tibble: 14 x 2
   stage.simplify TNM_CLIN_T
   <chr>          <chr>     
 1 1              1MI       
 2 1              c1MI      
 3 1              1         
 4 1              1A        
 5 1              1A1       
 6 1              1A2       
 7 1              1B1       
 8 2              2MI       
 9 2              c2MI      
10 2              2         
11 2              2A        
12 2              2A1       
13 2              2A2       
14 2              2B1  

then join the dictionary do the lookup:

df %>% left_join(stage_look_up,by = 'TNM_CLIN_T')
# A tibble: 8 x 2
  TNM_CLIN_T stage.simplify
  <chr>      <chr>         
1 blank      NA            
2 cX         NA            
3 blank      NA            
4 c4         NA            
5 c3         NA            
6 c2         NA            
7 1A2        1             
8 c1         NA   

I didn't set those simplified stages for "c1", "c2" ..., so it remains many NA in the result. You may have to finish setting them.