Apply function to dataframe in chunks

dplyr
purrr

#1

Sometimes I want to apply a function to a dataframe that involves increasing the number of rows hugely before immediately being reduced by another operation (aggregation, filtering, etc.).

Often this situation arises when I'm trying to keep my data pipeline tidy, rather than using a wide format.

I've come to use a workaround: splitting the dataframe's into chunks and applying the function to each chunk individually. However, this makes for harder to read code and overall feels dirty.

Has anyone come up with a better solution?


Slightly contrived reprex to show what I mean:

suppressPackageStartupMessages(library(tidyverse))

# Generate some data
n <- 500
data1 <- tibble(x = 1:n, f = runif(n))

reps <- 1000
data2 <- tibble(x = rep(1:n, reps), y = runif(n * reps))

# What I want to do: 
# However, this can take a long time when both `n` and `reps` are large
# Or worse, it can use up all my available RAM during the join
data1 %>% 
  left_join(data2, by = "x") %>%
  group_by(x) %>% 
  summarise(z = sum(f * x))
#> # A tibble: 500 x 2
#>        x     z
#>    <int> <dbl>
#>  1     1  918.
#>  2     2  502.
#>  3     3 2726.
#>  4     4 3087.
#>  5     5  775.
#>  6     6 2440.
#>  7     7 5605.
#>  8     8 5928.
#>  9     9 1802.
#> 10    10 6045.
#> # ... with 490 more rows

# My workaround: Split data into groups and apply to groups separately,
# before combining again

n_groups <- 100
split_data <- split(data1, sample.int(n_groups, n, replace = TRUE))

map_dfr(
  split_data, 
   . %>% 
    left_join(data2, by = "x") %>%
    group_by(x) %>% 
    summarise(z = sum(f * x))
)
#> # A tibble: 500 x 2
#>        x       z
#>    <int>   <dbl>
#>  1    26  10721.
#>  2    47  32598.
#>  3   193  27723.
#>  4    36   6901.
#>  5    66  22052.
#>  6   216 169838.
#>  7   367 292813.
#>  8   376  93363.
#>  9   427 113632.
#> 10    64  31491.
#> # ... with 490 more rows

#2

There's a blog post by MikeFC from a couple months back that goes through various alternatives to split/apply/combine that might be helpful:


#3

For your particular example, you can achieve the same desired result without joining the two tibbles:

data1 %>%
  group_by(x) %>%
  summarize(z = sum(f * x) * reps)

In general, you may find data.table faster when working with data sets that have large numbers of groups. In the majority of scenarios I have encountered, the execution speed of data.table is not worth forgoing the other benefits of using the tidyverse. But, in cases where the execution speed is worth pursuing, you can use data.table's [ operator to join the tables:

library(data.table)
d1 <- as.data.table(data1)
d2 <- as.data.table(data2)

setkey(d1, "x")
setkey(d2, "x")

d1[d2, ] %>%
  group_by(x) %>%
  summarize(z = sum(f * x))

We can use microbenchmark to compare the speeds of joining using left_join versus the data.table [ operator:

microbenchmark::microbenchmark(
    e1 = data1 %>% dplyr::left_join(data2, by = "x"),
    e2 = d1[d2, ],
    times = 20L
)
#> Unit: milliseconds
#>  expr       min        lq      mean    median       uq      max neval
#>    e1 24.028803 26.777833 30.765806 30.163442 34.67949 38.39312    20
#>    e2  4.023605  4.964478  7.360335  6.503305  9.42847 13.61232    20

#4

Thanks, Mara. This looks really interesting (and some R/tidyverse history thrown in as well!). Seems like split %>% map_dfr is the way to go. It gets a bit more messy if you want to add a progress bar but I believe that's being worked on(?).

Yes, I've looked at data.table before and come to the same conclusion as you. I think the design choices basically just aren't for me. Perhaps it's worth taking another look. Thanks.


#5

Just using data.table for the join and then applying dplyr syntax is a bit strange in my eyes, particularly as it's the group_by operation which is very slow for large numbers of groups in dplyr.

The data.table code would simply be:
d1[d2, .(z = sum(f * x)), x]