Identify columns with select expressions without actually selecting

ttrodrigz · March 3, 2021, 6:40pm

Given a select expression, I would like to be able to identify and return the column names in the form of a character vector without having to first select the columns.

For example, I might do something like this:

library(dplyr)

mtcars %>%
    select(
        matches("^d"),
        matches("t$"),
        last_col()
    ) %>%
    names()

[1] "disp" "drat" "wt"   "carb"

What would be the best way of doing this without first having to manipulate the data?

Thanks!

joels · March 3, 2021, 7:04pm

You could do the following, which operates on the vector of column names:

names(mtcars)[c(grep("^d|t$", names(mtcars)), length(names(mtcars)))]

But it actually requires less typing with dplyr functions:

names(select(mtcars, matches("^d|t$"), last_col()))

In your code, note that you can avoid a second matches call by using the pattern "^d|t$".

ttrodrigz · March 3, 2021, 7:14pm

True, that does work, but doesn't quite solve what I'm going after. The goal is to provide any arbitrary select expression and have it return the corresponding column names a data frame without having to "hard code" it like how you wrote out in your first example. I'm guessing there is a way involving rlang.

joels · March 3, 2021, 7:20pm

You can use the ... argument to allow an arbitrary number of expressions and then capture and splice them inside the function. I'm not sure if this is the "right" way to do this with tidyeval, but this seems to work:

library(tidyverse)

fnc = function(data, ...) {
  select.expr = rlang::exprs(...)
  names(select(data, !!!select.expr))
}

fnc(mtcars, matches("^d|t$"), last_col())
#> [1] "disp" "drat" "wt"   "carb"

fnc(iris, starts_with("Sep"), matches("ies"))
#> [1] "Sepal.Length" "Sepal.Width"  "Species"

fnc(mtcars, 1, 3, 5)
#> [1] "mpg"  "disp" "drat"

^{Created on 2021-03-03 by the reprex package (v1.0.0)}

ttrodrigz · March 3, 2021, 7:28pm

See how you're still using select() though? That's what I'm trying to avoid.

joels · March 3, 2021, 7:29pm

Your previous response mentioned the ability to provide an arbitrary select expression. Can you say more about what you're trying to accomplish?

ttrodrigz · March 3, 2021, 7:35pm

This is really just a thought experiment. I'm curious if there is a way using tidyselect to provide an expression, a data frame, and have it return the names of the columns without actually performing any manipulation on the data first (e.g., selecting).

One practical example may be where you're working with a massive data frame and you want to want to avoid any manipulation of that data in the form of selecting - to avoid computation time - but still be able to carry out this operation of retrieving column names. I honestly have never benchmarked how long it takes to do this on a large data frame so this could be a moot point.

ttrodrigz · March 3, 2021, 7:37pm

I think I found a clue: tidyselect::eval_select()

joels · March 3, 2021, 7:54pm

select uses eval_select, so I'm not sure you're really avoiding select with that approach.

Here's the code for the select function:

select.data.frame = function (.data, ...) {
    loc <- tidyselect::eval_select(expr(c(...)), .data)
    loc <- ensure_group_vars(loc, .data, notify = TRUE)
    dplyr_col_select(.data, loc, names(loc))
}

We can use the first line of this function in a new function that only calls tidyselect::eval_select. In the code below, I check the time to run various approaches on a large data frame. As you can see, eval_select and select take about the same amount of time (which is to be expected, since eval_select is doing most of the work of select). Both take much longer than indexing with names, but they are still very fast in an absolute sense. Furthermore, select doesn't take longer on a large data frame when compared with a small one.

library(microbenchmark)
library(tidyverse)

fnc2 = function (.data, ...) {
  loc <- tidyselect::eval_select(expr(c(...)), .data)
  names(.data)[loc]
}

set.seed(3)
x=replicate(50, rnorm(1e6)) %>% as.data.frame

microbenchmark(
  x_sel = select(x, matches("1|3"), last_col()),
  x_eval_sel = fnc2(x, matches("1|3"), last_col()),
  x_names = names(x)[c(grep("1|3", names(x)), length(names(x)))],
  mtcars_sel = select(mtcars, matches("^d|t$"), last_col()),
  unit = "ms"
)
#> Unit: milliseconds
#>        expr      min       lq       mean   median        uq       max neval cld
#>       x_sel 1.199423 1.286958 1.52163405 1.435523 1.5678035  7.866493   100   b
#>  x_eval_sel 1.117132 1.181079 1.39159259 1.307535 1.4240925  4.542373   100   b
#>     x_names 0.011266 0.015231 0.01721326 0.016686 0.0191415  0.035172   100  a 
#>  mtcars_sel 1.150405 1.242889 1.64926607 1.420123 1.4656155 24.811602   100   b

^{Created on 2021-03-03 by the reprex package (v1.0.0)}

ttrodrigz · March 3, 2021, 8:01pm

Makes sense, thanks!

system · March 10, 2021, 8:01pm

This topic was automatically closed 7 days after the last reply. New replies are no longer allowed.

If you have a query related to it or one of the replies, start a new topic and refer back with a link.