How to use tidyeval to run functions stored as strings on corresponding columns

Dear helpful R community,

I am in a situation where I have

  • many sets of data, whose columns keep changing positions/names (albeit with slight adjustments,typos)
  • 2 types of functions that I need to run by group on all datasets/columns depending upon their classification

I asked this question on SO too, but later realised that with the no. of datasets and columns, it will be impractical to manually code the functions.

A sample data set could look like this:

### sample, simplified dataframe
df1 <- tibble(A=c(NA, 1, 2, 3), B = c(1,2,1,NA), C = c(NA,NA,NA,2), D = c(2,3,NA,1), E = c(NA,NA,NA,1))

### sample function dataframe
funcDf <- tibble(colNames = names(df1), type = c(rep("Compulsory", 4), "Conditional"))
funcDf <- funcDf %>%
mutate(func = as.character(glue("is.na({funcDf$colNames})")))
funcDf[funcDf$colNames == "E",]$func <- "ifelse(is.na(E) & !is.na(A), 1, 0)"

I would like to apply the relevant function to the corresponding column, which can be identified from funcDf, and needs to be applied on df1.
I thought this would be a standard use-case for tidyeval, but I will be thankful for other advice/suggestions as well.

Hi @scac_1041,

welcome to the community! I am not sure how flexible you are in terms of defining the way the functions are passed/defined. In the example you have, the functions look weird, as they have x as an argument, but x is not used anywhere in the function definition (body). I have attempted a simplified version of your problem where for now I apply a single function to all columns of the data frame. Is that going on the right direction?

I have defined functions as a list here so that we could have more than one function, even though for this simple example that wouldn't be necessary.

library(tidyverse)
### sample, simplified dataframe
df1 <- tibble(A=c(NA, 1, 2, 3), B = c(1,2,1,NA), C = c(NA,NA,NA,2), D = c(2,3,NA,1), E = c(NA,NA,NA,1))
functions <- list(f1 = str2lang("is.na"))
g <- as.call(list(functions$f1, quote(z)))
sapply(X = df1, FUN = function(z){eval(g)})
#>          A     B     C     D     E
#> [1,]  TRUE FALSE  TRUE FALSE  TRUE
#> [2,] FALSE FALSE  TRUE FALSE  TRUE
#> [3,] FALSE FALSE  TRUE  TRUE  TRUE
#> [4,] FALSE  TRUE FALSE FALSE FALSE

Created on 2019-10-17 by the reprex package (v0.3.0)

1 Like

Thank you so much for the contribution, @valeri

I see your point about the functions. I will edit them to make them proper functions.
I think you are going in the right direction, there is just 1 thing to consider that each column can have 2 different types of functions (based on the compulsory/conditional label in my funcDf:

  • is.na()
  • ifelse(column1 and column2 == "x", 1, 0)

And with the funcDf, I was basically trying to store functions applicable to corresponding column names.

Hope this is clearer?

Could you try to write this as a proper R function? Not sure what column1 and column2 refer to given that you are applying a given function to a single column of a df. Thanks

library(tidyverse);library(glue)
#> 
#> Attaching package: 'glue'
#> The following object is masked from 'package:dplyr':
#> 
#>     collapse

df1 <- tibble(A=c(NA, 1, 2, 3), B = c(1,2,1,NA), C = c(NA,NA,NA,2), D = c(2,3,NA,1), E = c(NA,NA,NA,1))
df1
#> # A tibble: 4 x 5
#>       A     B     C     D     E
#>   <dbl> <dbl> <dbl> <dbl> <dbl>
#> 1    NA     1    NA     2    NA
#> 2     1     2    NA     3    NA
#> 3     2     1    NA    NA    NA
#> 4     3    NA     2     1     1

funcDf <- tibble(colNames = names(df1), type = c(rep("Compulsory", 4), "Conditional"))
funcDf <- funcDf %>%
mutate(func = as.character(glue("is.na({funcDf$colNames})")))
funcDf[funcDf$colNames == "E",]$func <- "ifelse(is.na(E) & !is.na(A), 1, 0)"

funcDf
#> # A tibble: 5 x 3
#>   colNames type        func                              
#>   <chr>    <chr>       <chr>                             
#> 1 A        Compulsory  is.na(A)                          
#> 2 B        Compulsory  is.na(B)                          
#> 3 C        Compulsory  is.na(C)                          
#> 4 D        Compulsory  is.na(D)                          
#> 5 E        Conditional ifelse(is.na(E) & !is.na(D), 1, 0)

Would this help in clarifying?
The ifelse function that I alluded will take 2 column names of the same dataframe on which we are applying the function.
In the example above, it is checking if column E has NA while column D does not, in which case it flags this as a missing value with 1.

Created on 2019-10-17 by the reprex package (v0.3.0)

It seems like if any of the functions operate on more than one column, then the sapply (and by extension the purrr) implementation will not work as easily since they will operate one column at a time. In your case, it might be that a loop is not (easily) avoidable. I was hoping that there is a explicit link between the type of column (e.g., compulsory and conditional) and the function that needs to be applied. That, however, doesn't seem to be the case. For example, it is not clear how column E being of type Conditional is linked to a dependency on another column (D in this case). How would the functions look if column A or let's say C was of type Conditional? What I am trying to understand is does every column has a specific function or does the function that needs to be applied depend on the column type?

@valeri thank you for staying to help :slightly_smiling_face:

Yes, you are right that there is no link when it comes to columns of type Conditional - they can refer to any 2 columns, 1 column, 3 columns

Columns of type Compulsory, however, always have is.na applied only to the specific column.

This is one of the reasons why I was using character to store functions - I felt it will only help in making functions to conditional columns malleable/easy to change.

How about this?

library(tidyverse)
library(glue)
#> 
#> Attaching package: 'glue'
#> The following object is masked from 'package:dplyr':
#> 
#>     collapse
library(rlang)
#> 
#> Attaching package: 'rlang'
#> The following objects are masked from 'package:purrr':
#> 
#>     %@%, as_function, flatten, flatten_chr, flatten_dbl,
#>     flatten_int, flatten_lgl, flatten_raw, invoke, list_along,
#>     modify, prepend, splice
library(purrr)

### sample, simplified dataframe
df1 <- tibble(A=c(NA, 1, 2, 3), B = c(1,2,1,NA), C = c(NA,NA,NA,2), D = c(2,3,NA,1), E = c(NA,NA,NA,1))

### sample function dataframe
funcDf <- tibble(colNames = names(df1), type = c(rep("Compulsory", 4), "Conditional"))
funcDf <- funcDf %>%
    mutate(func = as.character(glue("is.na({funcDf$colNames})")))
funcDf[funcDf$colNames == "E",]$func <- "ifelse(is.na(E) & !is.na(A), 1, 0)"

purrr::map_df(set_names(funcDf$func, names(df1)), function(x) {eval_tidy(str2lang(x), data = df1)})
#> # A tibble: 4 x 5
#>   A     B     C     D         E
#>   <lgl> <lgl> <lgl> <lgl> <dbl>
#> 1 TRUE  FALSE TRUE  FALSE     0
#> 2 FALSE FALSE TRUE  FALSE     1
#> 3 FALSE FALSE TRUE  TRUE      1
#> 4 FALSE TRUE  FALSE FALSE     0

Created on 2019-10-17 by the reprex package (v0.3.0)

1 Like

Splendid, sensei!
This was the 1 line of code I was looking for but was too ignorant about.

In regards to the solution:

  • I only need to be mindful that I setnames correctly, otherwise function may be applied to another column?
  • Would you recommend any resources to understand such usage of tidyeval? How did you learn?
  • For my learning, would you agree that tidyeval is the better way to go with this use-case, or would you have done something different?
    Only words suffice, no code :slight_smile:

Thank you, once again!

Thanks a lot @scac_1041,

I will try to give you some comments on your questions :slight_smile:

  • I only need to be mindful that I setnames correctly, otherwise function may be applied to another column?

I have used set_names here just so the column names of the final result are the same as the column names of df1 - the code would execute properly without it. The functions will be applied in exactly the same way regardless whether we use set_names because the tidy_eval evaluates the code in funcDf$func where the column names that are affected are already defined (basically hard-coded).

  • Would you recommend any resources to understand such usage of tidyeval ? How did you learn?

I like these 2 resources: https://adv-r.hadley.nz/ and https://tidyeval.tidyverse.org/ - but I have mostly learned by trying to solve my own challenges

  • For my learning, would you agree that tidyeval is the better way to go with this use-case, or would you have done something different?

So, str2lang is base R which simply converts the string to something like an "unevaluated function call". And here comes the use of eval_tidy. Base R has it's own eval function but that one doesn't allow you to include the context within which you are evaluating you function call. As in this case we need to evaluate all functions in the context of the data frame df1, I think we have to use eval_tidy (I don't see how that would work with base eval), so that when we evaluate expressions like is.na(A), the A here refers to a column of df1.

1 Like

This topic was automatically closed 7 days after the last reply. New replies are no longer allowed.