Dynamically create dataframe names based on glue-like operators as in SAS and STATA

eric_f · October 10, 2020, 12:16am

Hi,

I've been trying to find a solution for this for over a year already and decided to write a post about it. Any help would be appreciated. Here is the pseudocode that I can do easily in Stata and SAS but I don't know how to do in R. {} is the glue-like operator that was introduced into dplyr this year.

var <- "mpg"
df_name <- "mtcars"

{df_name} %>% count({var})

{df_name}_1 <- {df_name} %>% mutate(., {var}_1={var}/2)

length({df_name}_1)

In Stata, I can easily do with local or global macros like this:
local df_name "mtcars"
then reference it as `df_name'

In SAS I can do it with global macros like this:
%LET df_name=mtcars;
then reference it like &df_name.

Please note how visually easy to parse these approaches are when the macro is references.

Both approaches allow to use them in dataset names, functions, variables, etc. Simplifies my code tremendously and saves me tons of time. How to do this with visual simplicity in R? My code should be readable for people familiar with Stata/SAS (dplyr is awesome in this regard!) and too many evals, wrapping everything in functions, assigns with parentheses will just make them give up on the project or force me to change back to SAS/Stata.

Would really appreciate any help in this matter! I had this problem back in 2009 with R and gave up on R until I had to come back in 2019 and still can't find an easy way to approach this.

technocrat · October 10, 2020, 2:10am

Short of writing a domain specific language to translate SAS/STRATA statements into well-formed R statements, there is no satisfactory solution to making R obvious to thinkers in those languages.

Instead, consider f(x) = y, the foundational paradigm of R. x is what is at hand, y is what is desired and f is the first-class object to transform one to the other. While a deeply composed f—i(h(g(f(x))) can cause eyes to glaze over, R does not insist on wrapping it all up in a single statement, and %>% is there to make the compositional steps easier to follow, as well as intermediate object names.

Better than either is to use a basket of functions to abstract away the syntax-related barriers.

# the input, an arbitrary data frame

names(mtcars)
#>  [1] "mpg"  "cyl"  "disp" "hp"   "drat" "wt"   "qsec" "vs"   "am"   "gear"
#> [11] "carb"

# the output desired, count

nrow(mtcars['mpg'])
#> [1] 32

# another output desired, result of division

mtcars['mpg']/2
#>                       mpg
#> Mazda RX4           10.50
#> Mazda RX4 Wag       10.50
#> Datsun 710          11.40
#> Hornet 4 Drive      10.70
#> Hornet Sportabout    9.35
#> Valiant              9.05
#> Duster 360           7.15
#> Merc 240D           12.20
#> Merc 230            11.40
#> Merc 280             9.60
#> Merc 280C            8.90
#> Merc 450SE           8.20
#> Merc 450SL           8.65
#> Merc 450SLC          7.60
#> Cadillac Fleetwood   5.20
#> Lincoln Continental  5.20
#> Chrysler Imperial    7.35
#> Fiat 128            16.20
#> Honda Civic         15.20
#> Toyota Corolla      16.95
#> Toyota Corona       10.75
#> Dodge Challenger     7.75
#> AMC Javelin          7.60
#> Camaro Z28           6.65
#> Pontiac Firebird     9.60
#> Fiat X1-9           13.65
#> Porsche 914-2       13.00
#> Lotus Europa        15.20
#> Ford Pantera L       7.90
#> Ferrari Dino         9.85
#> Maserati Bora        7.50
#> Volvo 142E          10.70

# function to transform input to output

find_count <- function(x,y) nrow(x[y])

find_count(mtcars,"mpg")
#> [1] 32

# function to divide input

divide_it <- function(x,y,z) x[y]/z

divide_it(mtcars,"mpg",2)
#>                       mpg
#> Mazda RX4           10.50
#> Mazda RX4 Wag       10.50
#> Datsun 710          11.40
#> Hornet 4 Drive      10.70
#> Hornet Sportabout    9.35
#> Valiant              9.05
#> Duster 360           7.15
#> Merc 240D           12.20
#> Merc 230            11.40
#> Merc 280             9.60
#> Merc 280C            8.90
#> Merc 450SE           8.20
#> Merc 450SL           8.65
#> Merc 450SLC          7.60
#> Cadillac Fleetwood   5.20
#> Lincoln Continental  5.20
#> Chrysler Imperial    7.35
#> Fiat 128            16.20
#> Honda Civic         15.20
#> Toyota Corolla      16.95
#> Toyota Corona       10.75
#> Dodge Challenger     7.75
#> AMC Javelin          7.60
#> Camaro Z28           6.65
#> Pontiac Firebird     9.60
#> Fiat X1-9           13.65
#> Porsche 914-2       13.00
#> Lotus Europa        15.20
#> Ford Pantera L       7.90
#> Ferrari Dino         9.85
#> Maserati Bora        7.50
#> Volvo 142E          10.70

^{Created on 2020-10-09 by the reprex package (v0.3.0.9001)}

eric_f · October 10, 2020, 3:18am

Thank you for the reply. I don't need to translate SAS/State statements to R, though. I am looking for a visually simply way to execute the pseudocode above in R. Quosures, !!, curly curly, NSE -- that's I've been digging at for the last year and still have no idea how to use them to execute the pseudocode above in its entirety. I can do the {var} part somewhat in the dplyr pipeline but i can't do anything with LHS of the <- and/or base R.

nirgrahamuk · October 10, 2020, 6:49am

Sounds like you want to treat names as strings so you can modify them , add suffixes to them. In R id think to use glue or paste to construct the symbols and use assign/get to make objects with those names and retrieve them from the environment respectively

eric_f · October 10, 2020, 8:21pm

This sounds very interesting! Can you show me an example using the pseudocode above? I read on SO that "get" is not recommended anymore, and too much verbiage (assign and etc) in the code makes the code hard to read. I'm in academic research so a lot of new data, with extensive data checks and dead ends in the code usually several thousands of lines long and any visual simplification would be very welcome.

I'm surprised this post hasn't produced 10s of replies.

eric_f · October 13, 2020, 11:01pm

So nobody know if it can be done? This is a very simple pseudocode.

StatSteph · October 13, 2020, 11:19pm

I use both SAS and R regularly. Sometimes, it's best not to mimic the behavior of another language. To be able to do things multiple times on the different inputs, use functions in R. Using the idea of SAS macros isn't quite the same because all that SAS macros do is that they replace text before compiling which isn't how R operates. I don't have a great answer for your question other than using get and assign which do get verbose because it's not natural in R to do what you're suggesting. I'm sorry this isn't a great answer and not what you're looking for.

mfherman · October 13, 2020, 11:44pm

I wonder if storing your set of data frames as a list rather than as separate objects in your environment might make this a little easier? Not totally clear if the following is what you're after, but I think it works for your pseudocode example. It does not read as idiomatic dplyr code to my eye, but maybe it's closer to what your colleagues are used to in SAS or Stata?

library(dplyr, warn.conflicts = FALSE)
library(glue, warn.conflicts = FALSE)

var <- "mpg"
df_name <- "mtcars"

var2 <- "Sepal.Length"
df_name2 <- "iris"

data <- list(mtcars = as_tibble(mtcars), iris = as_tibble(iris))

data[[df_name]] %>% 
  count(.data[[var]])
#> # A tibble: 25 x 2
#>      mpg     n
#>    <dbl> <int>
#>  1  10.4     2
#>  2  13.3     1
#>  3  14.3     1
#>  4  14.7     1
#>  5  15       1
#>  6  15.2     2
#>  7  15.5     1
#>  8  15.8     1
#>  9  16.4     1
#> 10  17.3     1
#> # ... with 15 more rows

data[[glue("{df_name}_1")]] <- data[[df_name]] %>% 
  mutate("{var}_1" := .data[[var]] / 2)

length(data[[glue("{df_name}_1")]])
#> [1] 12

data[[df_name2]] %>% 
  count(.data[[var2]])
#> # A tibble: 35 x 2
#>    Sepal.Length     n
#>           <dbl> <int>
#>  1          4.3     1
#>  2          4.4     3
#>  3          4.5     1
#>  4          4.6     4
#>  5          4.7     2
#>  6          4.8     5
#>  7          4.9     6
#>  8          5      10
#>  9          5.1     9
#> 10          5.2     4
#> # ... with 25 more rows

data[[glue("{df_name2}_1")]] <- data[[df_name2]] %>% 
  mutate("{var2}_1" := .data[[var2]] / 2)

length(data[[glue("{df_name2}_1")]])
#> [1] 6

^{Created on 2020-10-13 by the reprex package (v0.3.0)}

Here are some references that describe how to use the .data pronoun with dplyr and glue strings to create variable names.

eric_f · October 14, 2020, 12:21am

How would you solve this via functions in a visually easy-to-understand way? It seems with {{ and !!, tidyverse is finally moving in the right direction even without functions just in straight dplyr pipelines. Not sure if this was an intended consequence though.

eric_f · October 14, 2020, 12:28am

Thank you for the reply!! I thought of packing everything in lists but it would make everything unfortunately more convoluted. I'm not that good in R and lists just add a layer of complexity in time-constrained situations and make the code less readable to colleagues. If i could do this:

glue("{df_name2}_1") <- df_name2 %>%   mutate("{var2}_1" := .data[[var2]] / 2)

that would be the best solution!

Some interesting tips in the post, though! Thank you!

nirgrahamuk · October 14, 2020, 9:01am

library(glue)
library(tidyverse)

dn <- "mtcars"
v <- "mpg"

myfunc <- function(dfname,var,suffix){
  df <- get(dfname)
  print(df %>% count(!!sym(var)))
  
  vn <- glue("{var}_{suffix}")
  new <-  df %>% mutate(.,!!sym(vn) :=!!sym(var)/2) %>% as_tibble()
  print(length(new))
  
  nn <- glue("{dfname}_{suffix}")
  assign(nn,new,envir = .GlobalEnv)
}


myfunc(dn,v,1)
myfunc(dn,"hp",3)

ls()
mtcars_1
mtcars_3

technocrat · October 14, 2020, 5:21pm

Valuable snippet for the name assignment.

The whole macro approach, though, strikes still strikes me as profoundly a-functional except in the case that there is the same hardwired object name to be passed to several different functions. Better, I assert, is to compose a piped function designed to operate in a purrr::map context to take an arbitrary object signifier.

Sean Connery in The Untouchables: “That’s the Chicago way.”

eric_f · October 14, 2020, 7:49pm

"same hardwired object name to be passed to several different functions" -- I agree this is exactly the process here. As I mentioned earlier it's not for production it is for academic research with a lot of pipes, dead ends, intermediate checks, intermediate summary tables where the initial dataset is checked in an "as needed" fashion against posterior versions of the dataset with different transformations. I tried to wrap things in functions but the overhead is too much to handle and code gets too cluttered. If the goal was production-ready code, I would of course aim for the function-oriented approach. This is a valuable discussion.

technocrat · October 14, 2020, 8:07pm

EDA vs deployment is a great distinction

system · November 4, 2020, 8:07pm

This topic was automatically closed 21 days after the last reply. New replies are no longer allowed.

If you have a query related to it or one of the replies, start a new topic and refer back with a link.