Given a select expression, I would like to be able to identify and return the column names in the form of a character vector without having to first select the columns.
True, that does work, but doesn't quite solve what I'm going after. The goal is to provide any arbitrary select expression and have it return the corresponding column names a data frame without having to "hard code" it like how you wrote out in your first example. I'm guessing there is a way involving rlang.
You can use the ... argument to allow an arbitrary number of expressions and then capture and splice them inside the function. I'm not sure if this is the "right" way to do this with tidyeval, but this seems to work:
This is really just a thought experiment. I'm curious if there is a way using tidyselect to provide an expression, a data frame, and have it return the names of the columns without actually performing any manipulation on the data first (e.g., selecting).
One practical example may be where you're working with a massive data frame and you want to want to avoid any manipulation of that data in the form of selecting - to avoid computation time - but still be able to carry out this operation of retrieving column names. I honestly have never benchmarked how long it takes to do this on a large data frame so this could be a moot point.
select uses eval_select, so I'm not sure you're really avoiding select with that approach.
Here's the code for the select function:
select.data.frame = function (.data, ...) {
loc <- tidyselect::eval_select(expr(c(...)), .data)
loc <- ensure_group_vars(loc, .data, notify = TRUE)
dplyr_col_select(.data, loc, names(loc))
}
We can use the first line of this function in a new function that only calls tidyselect::eval_select. In the code below, I check the time to run various approaches on a large data frame. As you can see, eval_select and select take about the same amount of time (which is to be expected, since eval_select is doing most of the work of select). Both take much longer than indexing with names, but they are still very fast in an absolute sense. Furthermore, select doesn't take longer on a large data frame when compared with a small one.
library(microbenchmark)
library(tidyverse)
fnc2 = function (.data, ...) {
loc <- tidyselect::eval_select(expr(c(...)), .data)
names(.data)[loc]
}
set.seed(3)
x=replicate(50, rnorm(1e6)) %>% as.data.frame
microbenchmark(
x_sel = select(x, matches("1|3"), last_col()),
x_eval_sel = fnc2(x, matches("1|3"), last_col()),
x_names = names(x)[c(grep("1|3", names(x)), length(names(x)))],
mtcars_sel = select(mtcars, matches("^d|t$"), last_col()),
unit = "ms"
)
#> Unit: milliseconds
#> expr min lq mean median uq max neval cld
#> x_sel 1.199423 1.286958 1.52163405 1.435523 1.5678035 7.866493 100 b
#> x_eval_sel 1.117132 1.181079 1.39159259 1.307535 1.4240925 4.542373 100 b
#> x_names 0.011266 0.015231 0.01721326 0.016686 0.0191415 0.035172 100 a
#> mtcars_sel 1.150405 1.242889 1.64926607 1.420123 1.4656155 24.811602 100 b