trying to understand why moving code outside a filter statement makes it so much faster

natekratzer · June 11, 2019, 2:16pm

Part of an analysis script at work was running extremely slowly and after trying a few things we found a huge speed boost from just moving a small piece of code outside of the filter statement. The goal is just to filter out a few brands from that part of the analysis. I've used the midwest dataset below make a reprex. The problem is solved in the sense that it runs fast enough for us to use now, but I'm very curious as to why there's such a dramatic speed boost from doing the same thing and saving it as a separate list before the filter statement.

library(tidyverse)
#> Warning: package 'tibble' was built under R version 3.5.3
#> Warning: package 'purrr' was built under R version 3.5.2
#> Warning: package 'dplyr' was built under R version 3.5.3
library(bench)
#> Warning: package 'bench' was built under R version 3.5.3

#load midwest dataset from tidyverse
all_county_df <- midwest

#define some of the counties we want to exclude later
some_county_df <- all_county_df %>%
  filter(county %in% unique(all_county_df$county[1:40]))

#exclude counties with a unique call inside the filter statement
inside_filter <- function() {
  df_filtered <- all_county_df %>%
    filter(!(county %in% unique(some_county_df$county)))
}

#exclude counties with a unique call outside the filter statement
outside_filter <- function() {
  some_county_list <- unique(some_county_df$county)
  df_filtered <- all_county_df %>%
    filter(!(county %in% some_county_list))
}

times <- bench::mark(
  inside_filter(),
  outside_filter()
)

#total time for inside_filter is 1.01 s
#total time for outside_filter is 450.05 ms
```

<sup>Created on 2019-06-11 by the [reprex package](https://reprex.tidyverse.org) (v0.3.0)</sup>
```

mara · June 11, 2019, 2:27pm

Hmm, I ran your code using the development version of dplyr and I get a much smaller difference.

library(tidyverse)
library(bench)


#load midwest dataset from tidyverse
all_county_df <- midwest

#define some of the counties we want to exclude later
some_county_df <- all_county_df %>%
  filter(county %in% unique(all_county_df$county[1:40]))

#exclude counties with a unique call inside the filter statement
inside_filter <- function() {
  df_filtered <- all_county_df %>%
    filter(!(county %in% unique(some_county_df$county)))
}

#exclude counties with a unique call outside the filter statement
outside_filter <- function() {
  some_county_list <- unique(some_county_df$county)
  df_filtered <- all_county_df %>%
    filter(!(county %in% some_county_list))
}

times <- bench::mark(
  inside_filter(),
  outside_filter()
)
times
#> # A tibble: 2 x 6
#>   expression            min   median `itr/sec` mem_alloc `gc/sec`
#>   <bch:expr>       <bch:tm> <bch:tm>     <dbl> <bch:byt>    <dbl>
#> 1 inside_filter()     310µs    347µs     2684.    82.4KB     29.3
#> 2 outside_filter()    307µs    331µs     2879.    82.4KB     29.6

^{Created on 2019-06-11 by the reprex package (v0.3.0)}

natekratzer · June 11, 2019, 3:14pm

Hi Mara,

Thanks for running this. I installed the dev version of dplyr and also saw the time difference go away (if anything inside_filter() may now be slightly faster

I had used dplyr 0.7.8 originally because of issues in the group_by() function (group_by memory efficiency regression · Issue #4334 · tidyverse/dplyr · GitHub) that are due to be fixed in dplyr 0.9 according to the latest comment on that issue thread.

Since I don't have a work-around for the group_by, I'll have to continue using 0.7.8 and the outside_filter() in my analysis code. But it is good to know that the speed difference is some sort of quirk of an older version of dplyr.

system · July 2, 2019, 3:14pm

This topic was automatically closed 21 days after the last reply. New replies are no longer allowed.