Write "dplyr::select" or just "select" in packages?

I am helping a client improve a suite of R packages that they use internally. We have come across a question and are not sure if the answer is substantive or just one of style. I'm hoping that the community can help.

Background: The packages succeed at doing something useful to the organization. But they were written without any regard to passing R CMD check. Most of my current effort is around helping make the packages pass R CMD check.

As a toy example, the packages previously listed dplyr in Depends:, and had a lot of functions like this:

#' Happy Select
#'
#' Just like dplyr's select function. But also prints an inspiring message.
#'
#' @param df a data.frame
#' @param ... other parameters to pass to dplyr's select function
#' @export
happy_select = function(df, ...) {
  print("Today is a wonderful day, isn't it?")
  select(df, ...)
}

Note that there was no @importFrom dplyr select. The code works because dplyr is listed in Depends:. But it triggers two NOTEs:

checking dependencies in R code ... NOTE
Package in Depends field not imported from: ‘dplyr’
These packages need to be imported from (in the NAMESPACE file)
for when this namespace is loaded but not attached.

checking R code for possible problems ... NOTE
happy_select: no visible global function definition for ‘select’
Undefined global functions or variables:
select

I fixed the NOTEs like these by moving dplyr from Depends: to Imports: and adding #' @importFrom dplyr select. Note that if you have a large number of packages in Depends: (as we did) you also get the NOTE:

Depends: includes the non-default packages:
...
Adding so many packages to the search path is excessive and importing
selectively is preferable.

The functions now look like this, and generate no complaints from R CMD check.

#' Happy Select
#'
#' Just like dplyr's select function. But also prints an inspiring message.
#'
#' @param df a data.frame
#' @param ... other parameters to pass to dplyr's select function
#' @importFrom dplyr select
#' @export
happy_select = function(df, ...) {
  print("Today is a wonderful day, isn't it?")
  select(df, ...)
}

My client now asked me an interesting question that I am not sure the answer of. They have seen a lot of code that always specifies the package you want to call the function from. Using that convention, the last line of happy_select would be dplyr::select(df, ...) instead of just select(df, ...).

My personal opinion is that:

  1. R CMD check seems to not care either way, so the code is likely "safe" as-is and
  2. dplyr::select seems more cautious, and might be useful to future readers who don't know where select is coming from

That is, I don't really have a strong opinion on this one way or the other. Is there an accepted convention for this in the community? And if so, is there anything substantive to back it up rather than just aesthetics?

Thanks.

1 Like

It looks like there are trade-offs with either method. This is covered in depth in Chapter 2 The whole game | R Packages and Chapter 7 R code | R Packages

As an aside, the sinew package is helpful if you're looking to check for functions without their relative packages appended: Append namespace to functions in script — pretty_namespace • sinew

1 Like

We try to fully qualify the function names as too often names can be replace with other packages, helps to ensure you don't miss imports, and like you indicate helps other know which select filter function you are using.

3 Likes

For what it's worth, I tend to err on the side of caution and clarity and use the fully qualified call i.e. dplyr::select(). This is on top of the #' @importFrom dplyr select

2 Likes

I think the important bits is the following bulled points from 1.1.3 at Writing R Extensions

Packages whose namespace only is needed to load the package using library(pkgname) should be listed in the ‘Imports’ field and not in the ‘Depends’ field. Packages listed in import or importFrom directives in the NAMESPACE file should almost always be in ‘Imports’ and not ‘Depends’.
Packages that need to be attached to successfully load the package using library(pkgname) must be listed in the ‘Depends’ field.

If you are calling functions from a package they should be listed in the Imports . I think Depends attaches all functions from a package listed there. I have never used Depends myself. I'd also note that usethis::use_package("pkg_name") also lists the package under Imports .I'd probably opt for @importFrom dplyr select in my own code.

1 Like

Someone once added MASS to an automated process, and by doing so, unknowingly masked dplyr::select(). Because it was a long process that ran in the middle of the night, it was hard to diagnose, and I looked in the most obvious places first. TL;DR it took me 4 hours to figure out what was going wrong.

Based on this experience, I do two things:

  1. Always add new packages before existing ones (e.g. library(MASS) should have gone before library(dplyr))
  2. When in doubt, it does no harm at all to prepend the package name

I even went so far to try to build a tool that could tell me if I was creating a function that already existed on CRAN. It’s quite imperfect tbh. But you can try it with library(collidr); collidr::CRAN_collisions("select") (replacing ‘select’ with any function name you want to check for)

1 Like

Same experience here! Thanks for sharing the cause. The error message when dplyr's select() doesn't behave as expected is not an obvious one. To make it worse, the MASS package doesn't detach or unload as it should when asked, so I have to start a new session if I want to skip the dplyr:: part. Thus, I use dplyr::select in any scripts part of a project that requires the R Matching package (which loads MASS as a dependency) to avoid headaches!

1 Like

My assumption was that the dplyr::select option would be a bit slower because R has to find dplyr. I tried microbenchmark, but failed to find the anticipated difference.

source <- data.frame(
  stringsAsFactors = FALSE,
  URN = c("aaa", "bbb", "ccc", "ddd", "eee", "fff", "ggg"),
  VIN = c("xxx", "xxx", "yyy", "yyy", "yyy", "zzz", "abc"),
  EventDate = c("2019-04-29","2019-11-04",
                "2019-06-18","2019-11-21","2020-11-18","2020-01-27",
                "2020-08-22"),
  Q1 = c(10, 5, 8, 10, 2, 4, 3),
  Q2 = c(1, 1, 1, 1, 2, 1, 2),
  Q3 = c(1, 4, 3, 2, 1, 2, 4),
  Q4 = c(2019, 2020, 2020, 2019, 2020, 2021, 2021),
  Sequence = c(1, 2, 1, 2, 3, 0, 0)
)
microbenchmark(
  bob <- source %>% select(Q1, Q4),
  bob2 <- source %>% dplyr::select(Q1, Q4),
  times=1000000
)
#############   OUTPUT   ###################
#Unit: milliseconds
#expr                                      min     lq     mean    median
#        bob <- source %>% select(Q1, Q4) 1.7483 1.8171 1.926814 1.8482
#bob2 <- source %>% dplyr::select(Q1, Q4) 1.7561 1.8258 1.935550 1.8573
#uq      max neval
#1.8820 116.5913 1e+06
#1.8914 106.2843 1e+06

#With so many replicates this program takes a significant time to run. I let it run over night.

I tried adding a few more functions (mutate, filter, arrange, and ggplot), but the difference in execution times was too small to be significant.