How to modify categorical data in data set

For my project, I am working with the National Cancer Data Base.

I am trying to clean up a variable for the T stage of a cancer (based on the T,N,M staging guidelines) by combining different stages (ex: "1A
, 1A1, 1A2, ) into simple stages ("1",2,3,4).

How can I perform these changes in my uploaded data set ?

Hi, could you provide any example data?

Hi, yes, here is an example with non-real data based on the data base, which has ~500k cases and 313 columns for each variable:

TNM_CLIN_T (column name: T stage)
blank
cX (unknown)
blank
c4
c3
c2
1A2
c1

Hi Chris, I see your example, but I still haven't figured out what exactly the "modification" you want to do with the data. I guess there're two steps in your ideal process:

  1. extract the rows based on stages named 1A#
  2. simplify the stage names

The following code might be considered (suppose your data stored in df):

library(dplyr)
library(stringr)
df %>% filter(str_detect(TNM_CLIN_T, "^1A[0-9]+")) %>% 
  mutate(TNM_CLIN_T = str_remove(TNM_CLIN_T,"^1A"))

Hi Yifan, thanks for your help.

Yes, I believe I'm working with a data frame.

To clarify, I would like to replace every instance of "1MI", "c1MI", "1", "1A", "1A1", "1A2", "1B1", etc with just "1".

This includes probably 20 total different values, not just those with "1A[0-9]". Please let me know if this makes sense.

Hi Chris, since the replacements for 1 are complex, I suggest you build a lookup dictionary first, which is basically a data.frame, and then use join to match these simplified stages. The look-up dictionary is like:

stage_look_up <- list(
  `1` = c("1MI", "c1MI", "1", "1A", "1A1", "1A2", "1B1"),
  `2` = c("2MI", "c2MI", "2", "2A", "2A1", "2A2", "2B1")
) %>% enframe %>% unnest(value) %>% 
   rename(TNM_CLIN_T = value, stage.simplify = name)
> stage_look_up
# A tibble: 14 x 2
   stage.simplify TNM_CLIN_T
   <chr>          <chr>     
 1 1              1MI       
 2 1              c1MI      
 3 1              1         
 4 1              1A        
 5 1              1A1       
 6 1              1A2       
 7 1              1B1       
 8 2              2MI       
 9 2              c2MI      
10 2              2         
11 2              2A        
12 2              2A1       
13 2              2A2       
14 2              2B1  

then join the dictionary do the lookup:

df %>% left_join(stage_look_up,by = 'TNM_CLIN_T')
# A tibble: 8 x 2
  TNM_CLIN_T stage.simplify
  <chr>      <chr>         
1 blank      NA            
2 cX         NA            
3 blank      NA            
4 c4         NA            
5 c3         NA            
6 c2         NA            
7 1A2        1             
8 c1         NA   

I didn't set those simplified stages for "c1", "c2" ..., so it remains many NA in the result. You may have to finish setting them.

This topic was automatically closed 21 days after the last reply. New replies are no longer allowed.

If you have a query related to it or one of the replies, start a new topic and refer back with a link.