Quick answer for devs who might see this
The long and the short of this is that I believe that rlang:::env_bind_impl()
is slow because it runs a (in R) for loop over the column names of the 3/4 million columns. It has to be a for loop because it uses base::assign()
inside it which is not vectorized, but perhaps it could be rewritten in C++ to be much faster?
More detailed
I took a good long look at this. I believe the issue runs deeper than you might expect. Rather than dig through the source code to see what kind of search is used (I think the select.cpp
actually uses a base R match
called from C++ in there somewhere), I did a performance test to first identify where the problem lies. The results were surprising.
Ignore the horrific example here. The point is that you have a 25k column data frame with 1 row. The first column is named first
.
x <- 1:25000
list_x <- purrr::map(rev(x), ~.x)
names(list_x) <- x
tbl_x <- as.tibble(list_x)
names(tbl_x)[1] <- "first"
tbl_x[1, 1:5]
## A tibble: 1 x 5
# first `2` `3` `4` `5`
# <int> <int> <int> <int> <int>
#1 25000 24999 24998 24997 24996
I used profvis
to get a flame graph of the results of running select(tbl_x, first)
. Note that at this moment I am running the development version of dplyr. If you are using 0.7.4
then your results will look slightly different as it does not use tidyselect
(instead it has the code that was extracted out of dplyr and into tidyselect), but nevertheless the rlang:::env_bind_impl()
is still there and is the cause of the slowdown (I tried it on both). These results are with only 25k columns, and it took around 2 seconds. I tried running 250k columns, but gave up as it took too much time (this slowdown doesn't scale linearly).
You can see from the flame graph how dplyr handles your select()
call, passing it up the chain to tidyselect
and then into rlang
at env_bury()
. The code for env_bind_impl()
looks like this:
> rlang:::env_bind_impl
function (env, data)
{
stopifnot(is_vector(data))
stopifnot(!length(data) || is_named(data))
nms <- names(data)
env_ <- get_env(env)
for (i in seq_along(data)) {
nm <- nms[[i]]
base::assign(nm, data[[nm]], envir = env_)
}
env
}
And that for loop at some point gets a 25k element named list where the names correspond to the column names of your data frame, and the data in each element is just the column position of that name (ie the first element of the list would have the name first
and would hold 1
). It loops through each element and assigns it to an environment that is later used in looking up our selection of the first
column (as far as i can tell).
This loop is making things seriously slow, maybe it's just temporary as they iron out the details of the rlang
paradigm. I assume once all is said and done it will be rewritten in C++, or optimized in some other vectorized way. The problem looks to be the base::assign()
, since it isn't vectorized and I think can only assign one item at a time to the environment env_
.
This was a good question, thanks for bringing it up! I hope they can speed things up, as this seems to be a major problem.