Column name search algorithm used by select?

jabowery · December 15, 2017, 9:50pm

Does select use a linear search for column names?

A select of one column against a tibble with nearly 3/4 of a million columns has been running -- cpu bound -- for a few hours and shows no sign of finishing. If it were a binary chop, it should take Olog(.75M) time which means that if each access took a full second, it should have terminated the search well within a minute.

I'm curious enough that I went to look at the "C" (& CPP) code at the dplyr repository, and quickly discovered what I really need to do is import the project into a compiled code debugger so I can trace things. So I tried opening the .Rproj file in RStudio, but didn't see anyway of tracing into the compiled code with the debugger.

So, I'm just going to ask: What search algorithm does select use to look up column names in tibbles?

PS: I'm quite aware that there is "absolutely no excuse" to be doing anything with a tibble of this many columns. If you would be so kind and can resist the, admittedly, irresistable temptation, please save yourself the tibble of informing me of this.

davis · December 16, 2017, 1:59am

Quick answer for devs who might see this

The long and the short of this is that I believe that rlang:::env_bind_impl() is slow because it runs a (in R) for loop over the column names of the 3/4 million columns. It has to be a for loop because it uses base::assign() inside it which is not vectorized, but perhaps it could be rewritten in C++ to be much faster?

More detailed

I took a good long look at this. I believe the issue runs deeper than you might expect. Rather than dig through the source code to see what kind of search is used (I think the select.cpp actually uses a base R match called from C++ in there somewhere), I did a performance test to first identify where the problem lies. The results were surprising.

Ignore the horrific example here. The point is that you have a 25k column data frame with 1 row. The first column is named first.

x <- 1:25000
list_x <- purrr::map(rev(x), ~.x)
names(list_x) <- x
tbl_x <- as.tibble(list_x)
names(tbl_x)[1] <- "first"

tbl_x[1, 1:5]
## A tibble: 1 x 5
# first   `2`   `3`   `4`   `5`
#  <int> <int> <int> <int> <int>
#1 25000 24999 24998 24997 24996

I used profvis to get a flame graph of the results of running select(tbl_x, first). Note that at this moment I am running the development version of dplyr. If you are using 0.7.4 then your results will look slightly different as it does not use tidyselect (instead it has the code that was extracted out of dplyr and into tidyselect), but nevertheless the rlang:::env_bind_impl() is still there and is the cause of the slowdown (I tried it on both). These results are with only 25k columns, and it took around 2 seconds. I tried running 250k columns, but gave up as it took too much time (this slowdown doesn't scale linearly).

You can see from the flame graph how dplyr handles your select() call, passing it up the chain to tidyselect and then into rlang at env_bury(). The code for env_bind_impl() looks like this:

> rlang:::env_bind_impl
function (env, data) 
{
    stopifnot(is_vector(data))
    stopifnot(!length(data) || is_named(data))
    nms <- names(data)
    env_ <- get_env(env)
    for (i in seq_along(data)) {
        nm <- nms[[i]]
        base::assign(nm, data[[nm]], envir = env_)
    }
    env
}

And that for loop at some point gets a 25k element named list where the names correspond to the column names of your data frame, and the data in each element is just the column position of that name (ie the first element of the list would have the name first and would hold 1). It loops through each element and assigns it to an environment that is later used in looking up our selection of the first column (as far as i can tell).

This loop is making things seriously slow, maybe it's just temporary as they iron out the details of the rlang paradigm. I assume once all is said and done it will be rewritten in C++, or optimized in some other vectorized way. The problem looks to be the base::assign(), since it isn't vectorized and I think can only assign one item at a time to the environment env_.

This was a good question, thanks for bringing it up! I hope they can speed things up, as this seems to be a major problem.

jabowery · December 16, 2017, 2:13am

Thanks for your work and detailed response. I was just about to delete my question out of embarrassment thinking I must have made some newbie error beyond an unforgivably large number of columns. This isn't to say I didn't -- but at least it's good to know I may have advanced the rlang effort, even if inadvertently.

I do have an additional question, however:

In the event that my curiosity gets the better of me again in the future, what is the development environment you used to generate the flame graph and how does one debug cpp code from a github repository of an .Rproj?

davis · December 16, 2017, 1:20pm

No need for embarrassment! The community here will always support a curious mind.

Flame graph

This is done completely within RStudio. There are two easily / similar ways to do this.

In the script editor, highlight the lines you want to profile and then select Profile -> Profile Selected Line(s).
Run Profile -> Start Profiling. Run run the lines of code that you want to profile. When you are done, there should be a red stop button that says Stop Profiling, click that and it should take you into the flame graph.

Debugging compiled C++ code from RStudio

I haven't found a great way to do this. Honestly I haven't been able to do it successfully myself. You definitely can't do it right from RStudio, you'll need to use the terminal at some point. I don't think it's easy, but there are a few resources, here are two.

A small section in R Packages by Hadley. http://r-pkgs.had.co.nz/src.html#src-debugging
A video on how to use the gdb debugger with an R process. https://vimeo.com/11937905

mungojam · December 17, 2017, 3:24pm

Great question, I've been hit by some tidyverse performance problems in my code this week and had to drop back to native R to get it to work well enough (profvis was very handy though I hadn't realised Rstudio had it built in like that, thanks @davis

For me the slow method was rename but I guess that was more down to it copying all the values than the lookup time. My dataset had about 5 columns and 500 rows, but the rename gets called about 15000 times so it all added up. I got my runtime down from 30 minutes to 4 minutes by reverting to R in the hot bits of the code.