Sometimes I want to apply a function to a dataframe that involves increasing the number of rows hugely before immediately being reduced by another operation (aggregation, filtering, etc.).
Often this situation arises when I'm trying to keep my data pipeline tidy, rather than using a wide format.
I've come to use a workaround: splitting the dataframe's into chunks and applying the function to each chunk individually. However, this makes for harder to read code and overall feels dirty.
Has anyone come up with a better solution?
Slightly contrived reprex to show what I mean:
suppressPackageStartupMessages(library(tidyverse))
# Generate some data
n <- 500
data1 <- tibble(x = 1:n, f = runif(n))
reps <- 1000
data2 <- tibble(x = rep(1:n, reps), y = runif(n * reps))
# What I want to do:
# However, this can take a long time when both `n` and `reps` are large
# Or worse, it can use up all my available RAM during the join
data1 %>%
left_join(data2, by = "x") %>%
group_by(x) %>%
summarise(z = sum(f * x))
#> # A tibble: 500 x 2
#> x z
#> <int> <dbl>
#> 1 1 918.
#> 2 2 502.
#> 3 3 2726.
#> 4 4 3087.
#> 5 5 775.
#> 6 6 2440.
#> 7 7 5605.
#> 8 8 5928.
#> 9 9 1802.
#> 10 10 6045.
#> # ... with 490 more rows
# My workaround: Split data into groups and apply to groups separately,
# before combining again
n_groups <- 100
split_data <- split(data1, sample.int(n_groups, n, replace = TRUE))
map_dfr(
split_data,
. %>%
left_join(data2, by = "x") %>%
group_by(x) %>%
summarise(z = sum(f * x))
)
#> # A tibble: 500 x 2
#> x z
#> <int> <dbl>
#> 1 26 10721.
#> 2 47 32598.
#> 3 193 27723.
#> 4 36 6901.
#> 5 66 22052.
#> 6 216 169838.
#> 7 367 292813.
#> 8 376 93363.
#> 9 427 113632.
#> 10 64 31491.
#> # ... with 490 more rows