Using mutate() in a function with the new colname as an argument

dplyr

#1

Hi
I've written a pipe to clean up some of my data, however I'll need to rerun this several times, so I thought I'd make a function speed things up (speed being relative as I'm spending 99,9% of the time trying to figure out R..:grinning:)

The pipe works as planned, feels a bit messy tho..?

mrgb_trus <- mrgb_trus %>% 
  mutate(MRGGG = str_replace_all(MRGB_gleason, c("3\\+3" = "1", "3\\+4" = "2", 
                                                 "4\\+3" = "3", "4\\+4" = "4", 
                                                 "4\\+5" = "5", "5\\+4" = "5", 
                                                 "5\\+5" = "5"))) %>% 
  mutate(MRGGG = replace(MRGGG, is.na(MRGGG), 0)) %>% 
  mutate(MRGGG = replace(MRGGG, MRGB_gleason == "3" | MRGB_gleason == "4", "1")) %>% 
  mutate(MRGGG = as.numeric(as.character(MRGGG))) %>% 
  mutate(MRGGG = parse_factor(MRGGG, levels = GGG_levels))

However inserting this into a function does not work, two problems as far as I can tell:

testfunc <- function(df, old_col, new_col) {
  GGG_levels <- c(0, 1, 2, 3, 4, 5)
  df <- df %>% 
    mutate(new_col = str_replace_all(old_col, c("3\\+3" = "1", "3\\+4" = "2", 
                                                   "4\\+3" = "3", "4\\+4" = "4", 
                                                   "4\\+5" = "5", "5\\+4" = "5", 
                                                   "5\\+5" = "5"))) %>% 
    mutate(new_col = replace(new_col, is.na(new_col), 0)) %>% 
    mutate(new_col = replace(new_col, old_col == "3" | old_col == "4", "1")) %>% 
    mutate(new_col = as.numeric(as.character(new_col))) %>% 
    mutate(new_col = parse_factor(new_col, levels = GGG_levels))
}
> testfunc(mrgb_trus, TRUS_G, TRUSGGG)
Error in mutate_impl(.data, dots) : 
  Evaluation error: object 'TRUS_G' not found.

This I "fixed" with selecting the column in the argument.

Then:

testfunc(mrgb_trus, mrgb_trus$TRUS_G, TRUSGGG)
 Error in mutate_impl(.data, dots) : 
  Evaluation error: object 'TRUSGGG' not found. 

How do I get the function to 1) use argument "old_col" to select a column in df without writing it every time, and 2) name the new column before the mutate has actually happened?

Thank you!


#2

So this may make things more confusing for you if you are still trying to learn R, but if you plan to use dplyr inside of custom functions, it is probably worth investing some time in figuring out the basics of tidyeval. This framework allows you to reference column names in the same manner that you can in dplyr functions.

Without your data, it is hard to troubleshoot my updated version of your function, but this may work for you:

library(dplyr)

testfunc <- function(df, old_col, new_col) {
  old_col <- enquo(old_col)
  new_col <- enquo(new_col)
  new_col_name <- quo_name(new_col)
  
  GGG_levels <- c(0, 1, 2, 3, 4, 5)
  df <- df %>% 
    mutate(!!new_col_name := str_replace_all(!!old_col, c("3\\+3" = "1", "3\\+4" = "2", 
                                                "4\\+3" = "3", "4\\+4" = "4", 
                                                "4\\+5" = "5", "5\\+4" = "5", 
                                                "5\\+5" = "5"))) %>% 
    mutate(!!new_col_name := replace(!!new_col, is.na(!!new_col), 0)) %>% 
    mutate(!!new_col_name := replace(!!new_col, !!old_col == "3" | !!old_col == "4", "1")) %>% 
    mutate(!!new_col_name := as.numeric(as.character(!!new_col))) %>% 
    mutate(!!new_col_name := parse_factor(!!new_col, levels = GGG_levels))
}

A very basic explanation (from someone who is far from an expert in tidyeval) is that the enquo function handles the quoting of the bare variables so that dplyr and other tidyverse functions know how to handle them. The quo_name function takes that quoted variable and creates a separate variable that can be used for new column assignment. I would highly recommend reading through the vignette referenced above for a more complete understanding of the basics of tidyeval.


#3

I'd suggest also giving wrapr::let() a try for tasks like this. Here is an example and some formal documentation.


#4

Thank you! They both seem to do what I'm looking for. I haven't had the time to read up on the documentation but I'll be back with a solution for my problem when I do.


#5

I also have a new article up on the seplyr package which is another good solution for parameterizing or abstracting over column names.


#6

It occurs to me that another benefit of the lookup table alternative I suggested over in the other thread is that it also happens to allow you to sidestep the non-standard evaluation business:

library(tidyverse)

# Setup example data
mrgb_trus <- data.frame(
  MRGB_gleason = c("3+4", "4", "3+4", "4+4", "3+3",NA, "3+4", "3+3", NA, "4+3", 
                   "3+3", "3+4", "3+4", NA, "3", "3+4", NA, NA, NA, NA, "4+3", "3+4", "3+3", 
                   "4+3", "4+4", "4+5", "3+3", "4+3", "4+3", NA, NA, "3+3", "4+4", "3+4", "4+5", 
                   "3+3", "5+4", NA, NA, "3+4", "4+3", NA, "3+3", "4+3", "3+4", "3+4", "3+4", NA, 
                   "4+4", "4+3", "3+4", "3+4"), 
  stringsAsFactors = FALSE)

mrgb_lookup <- data.frame(
  gleas_score   = c("5+4", "5+5", "4+5", "4+4", "4+3", "3+4", "3+3", "3", "4", NA ),
  gleas_grd_grp = c(  "5",   "5",   "5",   "4",   "3",   "2",   "1", "1", "1", "0"),
  stringsAsFactors = FALSE
)

gs_to_ggg <- function(df, lookup, colname_gs, colname_ggg) {
  # Build the `by` parameter: 
  #   - Gleason scores should be in the first col of lookup table
  #   - `colname_gs` should contain the name of the Gleason score
  #      variable in `df`
  join_by <- names(lookup)[1]
  names(join_by) <- colname_gs
  
  df <- df %>% inner_join(lookup, by = join_by)
  
  # The last column added to df will be the Gleason Grade Groups
  # from the lookup table; rename it to value of `colname_ggg`
  names(df)[length(df)] <- colname_ggg
  df
}

mrgb_trus %>% 
  gs_to_ggg(
    mrgb_lookup, 
    colname_gs = "MRGB_gleason",
    colname_ggg = "MRGGG"
  ) %>% 
  head(20)
#>    MRGB_gleason MRGGG
#> 1           3+4     2
#> 2             4     1
#> 3           3+4     2
#> 4           4+4     4
#> 5           3+3     1
#> 6          <NA>     0
#> 7           3+4     2
#> 8           3+3     1
#> 9          <NA>     0
#> 10          4+3     3
#> 11          3+3     1
#> 12          3+4     2
#> 13          3+4     2
#> 14         <NA>     0
#> 15            3     1
#> 16          3+4     2
#> 17         <NA>     0
#> 18         <NA>     0
#> 19         <NA>     0
#> 20         <NA>     0

Created on 2018-07-03 by the reprex package (v0.2.0).


Mutate and replace strings to new column