unique() performance

aampohl · October 8, 2018, 10:47pm

I have noticed a performance difference between R 3.4.4 and 3.5.0. The issue I'm having is with the base R function unique(). unique() seems to run much faster if the dataset doesn't have a lot of columns that are factors. Below is an example R script that should be reproducible in rocker/tidyverse Docker containers:

suppressMessages(install.packages('microbenchmark', repos='https://cran.rstudio.com/', quiet=TRUE))
suppressPackageStartupMessages({
    library(dplyr)
    library(microbenchmark)
})

args = commandArgs(trailingOnly=TRUE)
rver <- args[1]

# I want a dataset (data.frame) with repeated rows, with or without factorized columns
get_dat <- function(dataset, multiplier=3, factorize = FALSE) {
    dat <- as_tibble(dataset)
    if (factorize) {
        dat <- mutate_if(dat, function(x) is.character(x) || is.integer(x), as.factor)
    }
    n_dat <- nrow(dat)
    n <- multiplier * n_dat
    dat <- sample_n(dat, n, replace=TRUE)
    dat <- as.data.frame(dat)
    dat
}

# this is where the unique() operation is tested
do_microbenchmark <- function(dataset, multiplier, factorize, msg) {
    dat <- get_dat(dataset, multiplier, factorize)
    mbdat <- microbenchmark(unique(dat), unit="ms", times=2000L)
    cat(paste0(msg, ':\n'))
    print(mbdat)
}

cat(paste0('\nUsing R version ', rver, '\n======================\n'))
do_microbenchmark(starwars, 5, TRUE, 'starwars dataset, converted to factors')
do_microbenchmark(starwars, 5, FALSE, 'starwars dataset, not converted to factors')

and running it I get:

$ for rver in 3.4.4 3.5.0; do docker run --rm -v $(pwd):/scratch -w /scratch rocker/tidyverse:$rver Rscript microbench-unique.R $rver; done
##
##Using R version 3.4.4
##======================
##starwars dataset, converted to factors:
##Unit: milliseconds
##        expr    min     lq     mean  median     uq     max neval
## unique(dat) 5.4102 5.7713 6.144108 6.00405 6.3163 45.8087  2000
##starwars dataset, not converted to factors:
##Unit: milliseconds
##        expr   min      lq     mean  median     uq     max neval
## unique(dat) 5.983 6.44155 6.781395 6.70885 6.9872 11.0138  2000
##
##Using R version 3.5.0
##======================
##starwars dataset, converted to factors:
##Unit: milliseconds
##        expr    min      lq     mean  median      uq      max neval
## unique(dat) 16.282 28.0554 36.48758 35.8414 43.9043 111.0781  2000
##starwars dataset, not converted to factors:
##Unit: milliseconds
##        expr    min     lq     mean  median     uq     max neval
## unique(dat) 2.0359 2.4058 2.706818 2.54465 2.7514 16.8791  2000

I know about distinct() from dplyr and other ways (e.g. data.table's unique() implementation), but what I'm interested in is information about the change to the base-R unique() implementation. unique() actually is built on duplicated(), and I saw that there was a bugfix to duplicated()/unique() that maybe arrived in R 3.5.0, but I don't know if this performance issue I'm seeing is related to that or something else.

Does anyone know anything about this? Thanks,
Andy

aampohl · October 11, 2018, 3:39pm

It seems I haven't started a very interesting topic. I'm a little surprised because even though there are other options to base R's unique(), it is probably in widespread use in the CRAN packageverse.

I traced the change of behavior to a specific revision (74133) of the R source code, and contacted the R core developer about it... a revision he described (perhaps jokingly) as "very dubious". But I don't know if my observations are going to lead to any R code changes. I think the changes were meant to fix an accuracy issue and to simplify as opposed to improve performace. And sometimes unique() actually speeds up instead of slowing down. If there's soon any discussion on the r-devel mailing list on the topic, I'll reply here with the thread just in case anyone's interested.

hughparsonage · October 12, 2018, 4:15am

Did you on R-devel? It seems appropriate.

jcblum · October 12, 2018, 4:36am

Two thoughts on this:

It’s good to keep in mind that most people will only ever see the title of your topic before deciding whether to read further. Very general titles (like the current one) tend to be less compelling, I think — compared to, say, “Why is unique() up to 6x slower on R 3.5 vs R 3.4?”
The population of people who follow R core development closely enough to have an idea about the answer is… not that many people, I suspect. I don’t know how many of those keep up with this forum often enough to notice this topic go by (and this kinda gets back to point 1 again).

If you do take this to R-devel, I look forward to hearing the outcome!

aampohl · October 19, 2018, 4:24pm

I haven't seen any discussion there yet, but I've discussed it by e-mail with Martin Mächler, and he's reproduced the issue and he's familiar with that code. The revision he made that caused the slower behavior (in certain circumstances) did manage to fix two bugs. It's not a very high priority, but I imagine it'll be remembered next time that unique()/duplicated() code is changed.