Mutate and replace strings to new column

dplyr
stringr

#1

I'm trying to mutate a column with values of Gleason grades for prostate cancer (e.g. 3+3, 3+4) into a system called Gleason Grade Group whose format is only one number (1,2,3 etc.).

The code below runs, and in the output I can see the "new_col" variable, but when I glimpse() or try to view the df its not there. Optimally I'd like to do this for all values in the column using a vector for all combinations (e.g. 3+3, 3+4, 4+3, into 1, 2, 3).

What am I doing wrong? I'm übernew to R, 2-3 weeks experience.

test %>% 
  mutate(new_col = str_replace(old_col, "3+3", "1"))

#2

Hi! Welcome!

If that’s the exact code you’re running, then I think the problem is that you haven’t actually assigned a new name to the mutated data frame. This is a really common point of confusion for people new to R! You get a display of the result in the console because objects print to the console by default (e.g., if you type test in the console, then you’ll see the test data frame printed in the console output).

Try:

test2 <- test %>% 
  mutate(new_col = str_replace(old_col, "3+3", "1"))

This will not print anything to the console (because you did something with the mutated data frame — assigned it to a new name — so the default print action did not kick in), but you should then be able to ask for test2 and see the mutated results.

If you need some pointers for good resources for getting up to speed with R, we’ve got a great thread for that:

And keep asking questions here! :grin:


#3

Hi
Thank you for the swift reply! This DOES (I guess obviously?:slight_smile: create a new column, however it doesn't actually replace the values. I guess I'm using str_replace wrong, or perhaps I should be doing something entirely different?

Exact code I'm running:

test <- mrgb_trus %>% 
  mutate(MRGG = str_replace(MRGB_gleason, "4+3", "1"))

Output from the "old" column:

> test$MRGB_gleason
 [1] "3+4" "3+3" NA    "3+4" NA    "4+3" "4+4" "4+3" "4+4" "5+4" "4+3" "4+3" "3+4" "4+3"
[15] "4"   NA    "4+3" NA    NA    "3+4" "4+5" NA    "3+4" NA    NA    "3+4" NA    "3+4"
[29] "3+4" "3+4" "3+3" "3"   NA    "3+3" "3+3" NA    "4+5" NA    "3+3" "3+4" "4+4" "3+4"
[43] "4+4" "3+3" "3+4" "3+4" NA    "4+3" "4+3" "3+3" "3+3" "3+4"

Output from the new column:

[1] "3+4" "3+3" NA    "3+4" NA    "4+3" "4+4" "4+3" "4+4" "5+4" "4+3" "4+3" "3+4" "4+3"
[15] "4"   NA    "4+3" NA    NA    "3+4" "4+5" NA    "3+4" NA    NA    "3+4" NA    "3+4"
[29] "3+4" "3+4" "3+3" "3"   NA    "3+3" "3+3" NA    "4+5" NA    "3+3" "3+4" "4+4" "3+4"
[43] "4+4" "3+3" "3+4" "3+4" NA    "4+3" "4+3" "3+3" "3+3" "3+4"

#4

AAAND I figured it out...I had to escape the "+". Thanks for the resource link btw, already halfway through Garretts book!

test <- mrgb_trus %>% 
  mutate(MRGG = str_replace(MRGB_gleason, "3\\+3", "1"))


#5

If your question's been answered (even if by you), would you mind choosing a solution? (See FAQ below for how).

Having questions checked as resolved makes it a bit easier to navigate the site visually and see which threads still need help.

Thanks


#6

Edit: See jcblums post for a better solution

Final solution:

mrgb_trus <- mrgb_trus %>% 
  mutate(MRGGG = str_replace_all(MRGB_gleason, c("3\\+3" = "1", "3\\+4" = "2", 
                                                 "4\\+3" = "3", "4\\+4" = "4", 
                                                 "4\\+5" = "5", "5\\+4" = "5", 
                                                 "5\\+5" = "5"))

I also had some values that didn't match the pattern, I managed to solve this with:

 mutate(MRGGG = replace(MRGGG, is.na(MRGGG), 0)) %>% 
 mutate(MRGGG = replace(MRGGG, MRGB_gleason == "3" | MRGB_gleason == "4", "1")) 

#7

Glad you worked out a solution! Here are a few alternative ideas that might be a bit more streamlined. When recoding variables like this, I personally strongly favor maximizing readability and future maintainability — I don't want it to be a mystery to future-me (or anybody else) where and how the data coding decisions are made.

Set up test data frame

library(tidyverse)

mrgb_trus <- data.frame(
  MRGB_gleason = c("3+4", "4", "3+4", "4+4", "3+3",NA, "3+4", "3+3", NA, "4+3", 
    "3+3", "3+4", "3+4", NA, "3", "3+4", NA, NA, NA, NA, "4+3", "3+4", "3+3", 
    "4+3", "4+4", "4+5", "3+3", "4+3", "4+3", NA, NA, "3+3", "4+4", "3+4", "4+5", 
    "3+3", "5+4", NA, NA, "3+4", "4+3", NA, "3+3", "4+3", "3+4", "3+4", "3+4", NA, 
    "4+4", "4+3", "3+4", "3+4"), 
  stringsAsFactors = FALSE)

Option 1: case_when()

mrgb_trus_case_when <- mrgb_trus %>% 
  mutate(
    MRGGG = case_when(
      is.na(MRGB_gleason) ~ "0",
      MRGB_gleason == "3" ~ "1",
      MRGB_gleason == "4" ~ "1",
      MRGB_gleason == "3+3" ~ "1",
      MRGB_gleason == "3+4" ~ "2",
      MRGB_gleason == "4+3" ~ "3",
      MRGB_gleason == "4+4" ~ "4",
      MRGB_gleason == "4+5" ~ "5",
      MRGB_gleason == "5+4" ~ "5",
      MRGB_gleason == "5+5" ~ "5"
    )
  )

Option 2: Join with a lookup table

To maximize maintainability, you could store your lookup table as a CSV (reading it in as needed). That way nobody has to go digging around inside the code to add translations, and the CSV itself can be stored along with other project metadata.

mrgb_lookup <- tribble(
  ~ gleas_score, ~ gleas_grd_grp,
    NA,            "0",
    "3",           "1",
    "4",           "1",
    "3+3",         "1",
    "3+4",         "2",
    "4+3",         "3",
    "4+4",         "4",
    "4+5",         "5",
    "5+4",         "5",
    "5+5",         "5"
)

mrgb_trus_inner_join <- mrgb_trus %>% 
  inner_join(mrgb_lookup, by = c("MRGB_gleason" = "gleas_score")) %>% 
  rename("MRGGG" = "gleas_grd_grp")   # new col will bring along name from lookup table

Both of these methods produce the same results as your solution:

mrgb_trus_3step <- mrgb_trus %>% 
  mutate(
    MRGGG = str_replace_all(
      MRGB_gleason, 
      c("3\\+3" = "1", "3\\+4" = "2", 
        "4\\+3" = "3", "4\\+4" = "4", 
        "4\\+5" = "5", "5\\+4" = "5", 
        "5\\+5" = "5")
    ),
    MRGGG = replace(MRGGG, is.na(MRGGG), 0),
    MRGGG = replace(MRGGG, MRGB_gleason == "3" | MRGB_gleason == "4", "1")
  )

identical(
  mrgb_trus_3step$MRGGG, 
  mrgb_trus_case_when$MRGGG
)
#> [1] TRUE

identical(
  mrgb_trus_3step$MRGGG, 
  mrgb_trus_inner_join$MRGGG
)
#> [1] TRUE

Notes:

  • As seen above, you can put all your mutate steps in a single call to mutate() — it applies the changes sequentially, so later steps in a single call get the updated values from earlier steps.
  • Eventually, you probably want to convert your Gleason Grade Group values into an ordered factor
  • You might be interested in the questionr package. It has some really neat interactive RStudio add-ins that help you build variable recoding code — see the vignette here: https://juba.github.io/questionr/articles/recoding_addins.html

Using mutate() in a function with the new colname as an argument
The best way to attach labels to numeric variables
#8

Thank you, again!
I came across case_when() trying to figure out another problem and changed the code for this problem to just last night. I think I'll be able to solve most of my recoding with case_when(). However the lookup table does seem more elegant...especially in combination with your solution for this problem.