When programming with dplyr, what is the correct way to avoid undefined global variables?

jjohn · March 9, 2020, 9:05pm

Background

This is from the development version "Programming with dplyr" vignette, beginning on line 189.

### Eliminating `R CMD check` `NOTE`s

If you're writing a package and you have a function that uses data-variables:

```{r}
my_summary_function <- function(data) {
  data %>% 
    filter(x > 0) %>% 
    group_by(grp) %>% 
    summarise(y = mean(y), n = n())
}

You'll get an R CMD CHECK NOTE:

N  checking R code for possible problems
   my_summary_function: no visible binding for global variable ‘x’, ‘grp’, ‘y’
   Undefined global functions or variables:
     x grp y

You can eliminate this by using .data$var and importing .data from its source in the rlang package (the underlying package that implements tidy evaluation):

#' @importFrom rlang .data
my_summary_function <- function(data) {
  data %>% 
    filter(.data$x > 0) %>% 
    group_by(.data$grp) %>% 
    summarise(y = mean(.data$y), n = n())
}

My Question

What if you are using user-supplied variable names in your function? For example,

my_summary_function <- function(data, group_var, weight_var) {
  data %>% 
    group_by({{group_var}}) %>% 
    summarise(weighted_count = sum({{weight_var}}))
}

When running devtools::check() this will yield an "Undefined global functions or variables" note. What is the best way to avoid this?

Is the best practice just to embrace every user supplied variable like this?

my_summary_function <- function(data, group_var, weight_var) {
  data %>% 
    mutate(weight_var := {{weight_var}}) %>%
    group_by({{group_var}}) %>% 
    summarise(weighted_count = sum(.data$weight_var))
}

mattwarkentin · March 9, 2020, 9:21pm

Someone on Twitter asked this question recently and someone responded saying they use utils::globalVariables(), but I can't find the tweet...I'm looking.

Maybe this is helpful too: how to solve "no visible binding for global variable" note?

jjohn · March 9, 2020, 9:27pm

Yeah, that post is what occasioned this question. It doesn't deal with the situation I've outlined, where the variable name is user-supplied.

I also saw the utils::globalVariables() description. I don't know a lot about about global variables, but I'm pretty sure I really don't want them in the package I'm writing. The variables supplied by the user will only ever be used inside that specific function.

mattwarkentin · March 9, 2020, 9:30pm

Unless I'm misinterpreting your question, you would need to include something like this somewhere in your source code for each relevant argument which is a missing a global binding.

utils::globalVariables('data', 'group_var', 'weight_var')

The values the user supplies is not relevant, since those can't be known ahead of time.

jjohn · March 9, 2020, 9:34pm

Oh yeah, that makes sense. Thanks.

This solves the problem I was having running devtools::check(). I'm stilled a bit confused about what global variables are in this context. If you or anyone has any good resources to point me to, I'd appreciate it!

malcolm · March 9, 2020, 10:20pm

As a secondary response, using .data does also solve that problem. Personally, I'm moving away from globalVariables() towards .data because I don't need to keep updating it.

re: what they are in this context, I think the help page of globalVariables() puts it pretty well:

For globalVariables , the names supplied are of functions or other objects that should be regarded as defined globally when the check tool is applied to this package.

jjohn · March 10, 2020, 2:43pm

Sorry if I'm missing something obvious here. How do you use .data$var when var is a user-defined expression?

malcolm · March 10, 2020, 3:49pm

I think maybe I misunderstood your two examples, but let's take a step back. The first one should not generate warnings because there are already global variables--the argument names. Do you get warnings when you do that? I haven't been able to generate any.

Here's what I just did to double check:

library(usethis)
create_package("testpkg2") # I already have a testpkg ;p
use_mit_license()
use_package("dplyr")
use_r("summary")

in R/summary.R, I put:

#' Here's a function
#'
#' @param data a data
#' @param group_var some var
#' @param weight_var this one too
#'
#' @export
#'
#' @importFrom dplyr %>%
my_summary_function <- function(data, group_var, weight_var) {
  data %>%
    dplyr::group_by({{group_var}}) %>%
    dplyr::summarise(weighted_count = sum({{weight_var}}))
}

Then, after building the package, I run check. Here's the output, but no warnings generated! Running the function also works as expected.

devtools::check()

devtools::check()
#> Updating testpkg2 documentation
#> Writing NAMESPACE
#> Loading testpkg2
#> Writing NAMESPACE
#> ── Building ────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────── testpkg2 ──
#> Setting env vars:
#> ● CFLAGS    : -Wall -pedantic
#> ● CXXFLAGS  : -Wall -pedantic
#> ● CXX11FLAGS: -Wall -pedantic
#> ──────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────
#>      checking for file ‘/Users/malcolmbarrett/Google Drive/Active/reference/testpkg2/DESCRIPTION’ ...  ✓  checking for file ‘/Users/malcolmbarrett/Google Drive/Active/reference/testpkg2/DESCRIPTION’
#>   ─  preparing ‘testpkg2’:
#>      checking DESCRIPTION meta-information ...  ✓  checking DESCRIPTION meta-information
#>   ─  checking for LF line-endings in source and make files and shell scripts
#>   ─  checking for empty or unneeded directories
#>   ─  building ‘testpkg2_0.0.0.9000.tar.gz’
#>      
#> ── Checking ────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────── testpkg2 ──
#> Setting env vars:
#> ● _R_CHECK_CRAN_INCOMING_REMOTE_: FALSE
#> ● _R_CHECK_CRAN_INCOMING_       : FALSE
#> ● _R_CHECK_FORCE_SUGGESTS_      : FALSE
#> ● NOT_CRAN                      : true
#> ── R CMD check ─────────────────────────────────────────────────────────────────
#> * using log directory ‘/private/var/folders/03/9x7925g54mncswxx06wpkxl00000gn/T/RtmpU2g3U5/testpkg2.Rcheck’
#> * using R version 3.6.1 (2019-07-05)
#> * using platform: x86_64-apple-darwin15.6.0 (64-bit)
#> * using session charset: UTF-8
#> * using options ‘--no-manual --as-cran’
#> * checking for file ‘testpkg2/DESCRIPTION’ ... OK
#> * this is package ‘testpkg2’ version ‘0.0.0.9000’
#> * package encoding: UTF-8
#> * checking package namespace information ... OK
#> * checking package dependencies ... OK
#> * checking if this is a source package ... OK
#> * checking if there is a namespace ... OK
#> * checking for executable files ... OK
#> * checking for hidden files and directories ... OK
#> * checking for portable file names ... OK
#> * checking for sufficient/correct file permissions ... OK
#> * checking serialization versions ... OK
#> * checking whether package ‘testpkg2’ can be installed ... OK
#> * checking installed package size ... OK
#> * checking package directory ... OK
#> * checking for future file timestamps ... OK
#> * checking DESCRIPTION meta-information ... OK
#> * checking top-level files ... OK
#> * checking for left-over files ... OK
#> * checking index information ... OK
#> * checking package subdirectories ... OK
#> * checking R files for non-ASCII characters ... OK
#> * checking R files for syntax errors ... OK
#> * checking whether the package can be loaded ... OK
#> * checking whether the package can be loaded with stated dependencies ... OK
#> * checking whether the package can be unloaded cleanly ... OK
#> * checking whether the namespace can be loaded with stated dependencies ... OK
#> * checking whether the namespace can be unloaded cleanly ... OK
#> * checking dependencies in R code ... OK
#> * checking S3 generic/method consistency ... OK
#> * checking replacement functions ... OK
#> * checking foreign function calls ... OK
#> * checking R code for possible problems ... OK
#> * checking Rd files ... OK
#> * checking Rd metadata ... OK
#> * checking Rd line widths ... OK
#> * checking Rd cross-references ... OK
#> * checking for missing documentation entries ... OK
#> * checking for code/documentation mismatches ... OK
#> * checking Rd \usage sections ... OK
#> * checking Rd contents ... OK
#> * checking for unstated dependencies in examples ... OK
#> * checking examples ... NONE
#> * DONE
#> Status: OK
#> ── R CMD check results ──────────────────────────────── testpkg2 0.0.0.9000 ────
#> Duration: 14.4s
#> 
#> 0 errors ✓ | 0 warnings ✓ | 0 notes ✓

my_summary_function(iris, Species, Sepal.Length)

library(testpkg2)
my_summary_function(iris, Species, Sepal.Length)
> # A tibble: 3 x 2
#>   Species    weighted_count
#>   <fct>               <dbl>
#> 1 setosa               250.
#> 2 versicolor           297.
#> 3 virginica            329.

That's what I'd expect because those names already exist in the function as argument names.

The second example uses .data but it actually doesn't need to, as I think you've surmised. Basically, .data is for when you already know the name (although users can supply strings to .data, eg. function(x = "some_var") .data[[x]]).

When I say I'm moving more towards .data, I mean it more in the sense of the programming with dplyr example, where I want to work with known variables that I would refer to by their bare names were I to be doing normal data analysis and not package dev. When users are supplying variables to work with, the approach you have should already work.

These approaches are also not in conflict. Here's an example where I group by Species using .data but then sum counts using a user-given variable.

# in the console:
use_package("rlang")

#' Here's another function
#'
#' @param data a data, probably iris
#' @param weight_var this one too
#'
#' @export
#'
#' @importFrom dplyr %>%
#' @importFrom rlang .data
summarize_by_species <- function(data, weight_var) {
  data %>%
    dplyr::group_by(.data$Species) %>%
    dplyr::summarise(weighted_count = sum({{weight_var}}))
}

This generates no warnings in check(), and it also works

summarize_by_species(iris, Sepal.Length)

library(testpkg2)
summarize_iris(iris, Sepal.Length)
#> # A tibble: 3 x 2
#>   Species    weighted_count
#>   <fct>               <dbl>
#> 1 setosa               250.
#> 2 versicolor           297.
#> 3 virginica            329.

system · March 17, 2020, 3:49pm

This topic was automatically closed 7 days after the last reply. New replies are no longer allowed.