Using rlang::.data$ with dplyr verbs to generate queries

boshek · March 10, 2018, 12:39am

Hi there,

I am trying to migrate over to a more tidy eval of doing things. As a warm up I was trying to be more explicit and use the rlang .data pronoun as outlined here:

We can fix that ambiguity by being more explicit and using the .data pronoun. This will throw an informative error if the variable doesn’t exist:
mutate_y <- function(df) { mutate(df, y = .data$a + .data$x) }

This was an attempt to remove all those ugly R CMD check notes like this:

checking R code for possible problems ... NOTE
function: no visible binding for global variable
'VARIABLE'

However I've run into a problem when using a dplyr verb on a database connection. I was having trouble creating a reproducible example because this manifests itself only AFAIK when running R CMD on a package. So I decided to actually make a package that included an internal sqlite database to illustrate the problem. That can be found here:

https://github.com/boshek/testpackage

When using .data$ on a column directly from a database connection before using collect() like this :
https://github.com/boshek/testpackage/blob/5189fa5db4559789fbee5a0ef6054ba8dc28609c/R/dummy.R#L8-L17

With the package installed you get an error message like this:

dot_data_not_found()
Error: Column `STATION_NUMBER` not found in `.data`

If however you use .data$ on a column directly from a database connection after using collect() like this:
https://github.com/boshek/testpackage/blob/5189fa5db4559789fbee5a0ef6054ba8dc28609c/R/dummy.R#L23-L33

with no error message and an expected output:

using_dot_data()
# A tibble: 1 x 15
  STATION_NUMBER STATION_NAME    PROV_TERR_STATE~ REGIONAL_OFFICE~ HYD_STATUS SED_STATUS LATITUDE LONGITUDE
  <chr>          <chr>           <chr>            <chr>            <chr>      <chr>         <dbl>     <dbl>
1 05AA008        CROWSNEST RIVE~ AB               3                A          D              49.6     -114.
# ... with 7 more variables: DRAINAGE_AREA_GROSS <dbl>, DRAINAGE_AREA_EFFECT <dbl>, RHBN <int>,
#   REAL_TIME <int>, CONTRIBUTOR_ID <int>, OPERATOR_ID <int>, DATUM_ID <int>

using_dot_data() however, has the disadvantage of dealing with an entire table from a database since the data is filtered after collect(). This is a disadvantage/deal-breaker on really big databases. The advantage here is that this does take care of the "no visible binding..." note in R CMD check.

Last example is an instance that does generate the "no visible binding..." note in R CMD check:
https://github.com/boshek/testpackage/blob/5189fa5db4559789fbee5a0ef6054ba8dc28609c/R/dummy.R#L39-L50

with the R CMD check vieweable here:

Question

How do I deal with bare variable names in dplyr verbs that are generating a database query (other than like this in my zzz.R files: if(getRversion() >= "2.15.1") utils::globalVariables(c("STATION_NUMBER"))) given that I want to leverage that efficiency of filtering before collecting the data?

ijlyttle · March 10, 2018, 1:38am

How would something like this work?

temp <- function(cyls) {
  
  sym_cyl <- sym("cyl")
  
  dplyr::filter(mtcars, !!sym_cyl %in% cyls)
  
}

temp(c(4, 6))

Doing it this way might let you use the filter() before the collect().

Of course, you would need to export stuff from rlang, as provided by usethis::use_tidy_eval() (which I think you are already doing).

FWIW, I find this to be an informative conversation: https://github.com/r-lib/rlang/issues/116

Hope this helps

boshek · March 12, 2018, 4:17pm

Thank you @ijlyttle. This definitely fixed my issue though I am not clear why one is able to get away with the .data pronoun when you filter() after collect()

https://github.com/boshek/testpackage/blob/7d13f1a7304a56e7c899ba3e0ad7d85442e00fb1/R/dummy.R#L26-L36

but need to use sym() when you filter before collect:

https://github.com/boshek/testpackage/blob/7d13f1a7304a56e7c899ba3e0ad7d85442e00fb1/R/dummy.R#L8-L20

The latter solution will be my approach moving forward but it isn't immediately obvious to me why both of these approaches can work.

That conversation was definitely helpful. Slowly light bulbs are going on for me in this space.

Sam

ijlyttle · March 12, 2018, 4:56pm

I'm glad it's working for you!

The difference, as I understand it, is where the filter() computation actually takes place. If the filter is done before the collect, dplyr (using dbplyr) translates the filtering statements into a SQL query, then the database uses the SQL query to return only the rows that you want. If the filter is done after the collect, all the rows are sent from the database, then the filtering is done by dplyr in your R session.

The second part of the explanation is that I think that the .data pronoun works only for data in your R session - the SQL translator does not know what to so with it. Using the !!sym... formulation gives the dbplyr SQL translator enough information to make the translation.

Does this help? I'll try to find some links that explain it in more detail.

boshek · March 12, 2018, 5:37pm

Yes this does indeed make sense. I guess it is just this slightly awkward intersection between using dplyr verbs to interact with a database in which the syntax is basically identical to using dplyr verbs to interact with a flat data file.

It makes me wonder if it might be more appropriate, when programming with dbplyr and dplyr, to revert back to creating raw sql strings rather than using dplyr verbs. Obviously this works but it doesn't feel intuitive.

Thank you for your help.

sverchkov · July 29, 2018, 3:35pm

I think it's just a bug with dbplyr not translating the .data$ pronoun properly, and it looks like the only dplyr verb that doesn't have this bug is select().
I opened an issue for it on github https://github.com/tidyverse/dbplyr/issues/132