Variables in package environment versus Global environment

jon_leslie · December 19, 2019, 7:12am

Hello,

I'm building a package for the first time and struggling with a problem that I believe is related to environments. In my package, I have a function that uses two data frames as in this example (this isn't the actual function, but should make the point):

my_function <- function() {
  temp <- df1 %>% 
    dplyr::left_join(df2, by = c("id", "id"))
  return(temp)
}

I have included the data frames as "internal data" in the package, stored in R/sysdata.rda. The function works fine. I can install the package and attach it in a new R session with library() and it does what it should.

However, in my workflow, I will be changing df1 and df2. The versions of these I put into the package are for testing the functions, but in reality I want them to work on versions of these objects that are going to grow over time. If, in my R session, I create another variable called "df1" (that has more rows than the version of df1 that is in my package) in the Global environment and run the function, it seems to use the package version of df1 and not the one in the Global environment.

Please can anyone suggest what the best practice would be to overcome this? Is it foolish to have this data in my package at all? My thought was to somehow override which version the function uses...to somehow include a step where it will search for the version in the Global environment and use that if it exists, and defer to the version in the package if it can't find it in the Global environment.

Thanks very much!

technocrat · December 19, 2019, 7:31am

Hi, and welcome!

I am not a package guy, yet. But I think I spot an easy fix to your function:

my_function <- function(x,y) {
    temp <- x %>% 
    dplyr::left_join(y, by = "id")
  return(temp)
}

You might even want to add a third argument for the join field

my_function <- function(x,y,j) {
    temp <- x %>% 
    dplyr::left_join(y, by = j)
  return(temp)
}

That allows you to call my_function with the names of any two df objects in your namespace without worry.

jon_leslie · December 20, 2019, 4:31pm

Hi,

Thanks very much for your suggestion. I'll give it some thought.

I believe I found a solution that is getting at what I was trying to do:

my_function <- function() {
  if("df1" %in% names(.GlobalEnv)){
    df1 <- .GlobalEnv$df1
  }
  if("df2" %in% names(.GlobalEnv)){
    df2 <- .GlobalEnv$df2
  }
  
  temp <- df1 %>% 
    dplyr::left_join(df2, by = c("id"))
  return(temp)
}

Would love to hear if anyone has thoughts as to whether this is a good approach or not.

jdblischak · January 3, 2020, 3:27pm

Since the data frames are examples meant to test the function, I think it would be more natural to save them as external data sets instead of internal data sets.

To do this, you can save them as binary R data files in the package subdirectory data/. For convenience, you can use usethis::use_data() to automate this. If df1 and df2 are defined in the current R session, you could run:

> usethis::use_data(df1, df2)
✔ Creating 'data/'
✔ Saving 'df1', 'df2' to 'data/df1.rda', 'data/df2.rda'

Then when you want to use the example data sets, you would run the following:

library(myPkg)
data(df1)
data(df2)
my_function(df1, df2)

And then if you subsequently modify df1 and df2 in the current R session, you can pass the updated data frames to the function:

# after modifying df1 and df2
my_function(df1, df2)

And this also gives you (and any other users of the package) the freedom to use other names:

my_function(df3, df4)

Here's a reproducible example using a modified version of the suggested function from @technocrat:

my_function <- function(x, y, j = NULL) {
  dplyr::left_join(x, y, by = j)
}

data("diamonds", package = "ggplot2")
df1 <- diamonds[, 1:7]
df2 <- diamonds[, c(1:3, 8:10)]

my_function(df1, df2, j = c("carat", "cut", "color"))

See the chapter Data from R Packages for more details on including data sets in R packages.

jlacko · January 3, 2020, 5:22pm

I concur with the idea of having the data external, and passing it as argument to the function. It will make applying of the function more flexible, and will not force the package users to name their data frames in a particular pattern.

One additional idea you may wish to consider is using an environment variable to separate flow between (unit) testing and production modes.

Something along the lines of:

my_function <- function() {

  drill <- as.logical(Sys.getenv("THIS_IS_NOT_A_DRILL", unset = FALSE)) # set appropriately by yer test_that

  if(!drill){
    df1 <- .GlobalEnv$df1
    df2 <- .GlobalEnv$df2
  }
  
  temp <- df1 %>% 
    dplyr::left_join(df2, by = c("id"))
  return(temp)
}

In your test_that environment you would need to set the THIS_IS_NOT_A_DRILL variable to TRUE (and reset afterwards!), so that the if clause would not trigger, and the internal data frames would retain priority over those in global environment.

I have found this pattern helpful in testing for expected behavior in hard to reproduce scenarios, such as a network failure.

jon_leslie · January 17, 2020, 9:32am

Thanks, all, for your input. That's very helpful!