Fitting many small simple lm() per row of a df using purrr::map and running out of memory

dougfir · September 3, 2020, 3:54pm

(Note that I have received feedback about the flaws of the approach below from a statistical sense and have taken onboard that time series might be more appropriate. For the here and now though, I would still like to get my code block to run successfully. So this is more a computation power based question than a stats or modeling question)

I'm seeking suggestions or tips on how to overcome a memory issue. By backup plan would be to write to rds in chunks on a nested loop with purrr::map nested within a regular for loop. But I wondered if there's a better way?

The code block below will result in 2 new data frames mydf and mydf_wide:

# Set up
library(tidyverse)
library(lubridate)
library(foreach)

# Create data
mydf <- data.frame(
  cohort = seq(ymd('2019-01-01'), ymd('2019-12-31'), by = '1 days'),
  n = rnorm(365, 1000, 50) %>% round,
  cohort_cost = rnorm(365, 800, 50)
) %>% 
  crossing(tenure_days = 0:365) %>% 
  mutate(activity_date = cohort + days(tenure_days)) %>% 
  mutate(daily_revenue = rnorm(nrow(.), 20, 1)) %>% 
  group_by(cohort) %>% 
  arrange(activity_date) %>% 
  mutate(cumulative_revenue = cumsum(daily_revenue)) %>% 
  arrange(cohort, activity_date) %>% 
  mutate(payback_velocity = round(cumulative_revenue / cohort_cost, 2)) %>% 
  select(cohort, n, cohort_cost, activity_date, tenure_days, everything())

## wider data
mydf_wide <- mydf %>% 
  select(cohort, n, cohort_cost, tenure_days, payback_velocity) %>% 
  group_by(cohort, n, cohort_cost) %>% 
  pivot_wider(names_from = tenure_days, values_from = payback_velocity, names_prefix = 'velocity_day_')

Object mydf_wide is a dataframe of app install cohorts, the cost 'cohort_cost' that we spent obtaining these installs and then wide columns from day 0 to day 365 for the % of revenue we have received back from each cohort by each day of tenure since install date (cohort date).

mydf_wide %>% head
# A tibble: 6 x 369
# Groups:   cohort, n, cohort_cost [6]
  cohort         n cohort_cost velocity_day_0 velocity_day_1 velocity_day_2 velocity_day_3 velocity_day_4 velocity_day_5 velocity_day_6 velocity_day_7 velocity_day_8
  <date>     <dbl>       <dbl>          <dbl>          <dbl>          <dbl>          <dbl>          <dbl>          <dbl>          <dbl>          <dbl>          <dbl>
1 2019-01-01  1018        856.           0.02           0.05           0.07           0.09           0.12           0.14           0.16           0.18           0.21
2 2019-01-02  1028        873.           0.02           0.05           0.07           0.09           0.12           0.14           0.16           0.19           0.21
3 2019-01-03   965        813.           0.02           0.05           0.07           0.1            0.12           0.15           0.17           0.2            0.22
4 2019-01-04  1026        704.           0.03           0.06           0.09           0.11           0.14           0.17           0.2            0.23           0.25
5 2019-01-05  1036        772.           0.03           0.05           0.08           0.1            0.13           0.16           0.19           0.21           0.24
6 2019-01-06  1089        878.           0.02           0.04           0.07           0.09           0.11           0.14           0.16           0.18           0.21

I need to create linear models to predict day n velocity payback based on velocity payback on a earlier date.

So, for every combination of day1 to day 365 I need to predict this % 'velocity' payback.

 models <- data.frame(
+   from = mydf$tenure_days %>% unique,
+   to = mydf$tenure_days %>% unique
+ ) %>% 
+   expand.grid %>% 
+   filter(to > from) %>% 
+   filter(from > 0) %>% 
+   arrange(from) %>% 
+   mutate(mod_formula = paste0('velocity_day_', to, ' ~ velocity_day_', from))
> models %>% head
  from to                     mod_formula
1    1  2 velocity_day_2 ~ velocity_day_1
2    1  3 velocity_day_3 ~ velocity_day_1
3    1  4 velocity_day_4 ~ velocity_day_1
4    1  5 velocity_day_5 ~ velocity_day_1
5    1  6 velocity_day_6 ~ velocity_day_1
6    1  7 velocity_day_7 ~ velocity_day_1

Data frame 'models' has a model formula for each combination of 1:365 to predict velocity from day n to day m e.g. predict day 10 payback velocity based on day 2 payback velocity. Predict day 100 payback velocity based on day 56 payback velocity. Etc.

I would now like to mutate a new column onto models that is a linear model of the model definition in field mod_formula:

models <- models %>% 
    mutate(model = purrr::map(mod_formula, ~lm(.x, data = mydf_wide)))

With the example data above, this does eventually complete in my instance of R. However, it's slow. My real data are longer than the example data above and when I run the same my session terminates with message 'unexpected closed session (not exact wording but along those lines'.

Running the individual models themselves takes no time it seems. I guess it's just that R is holding the whole df in memory and that a lm object per row takes up a lot of RAM? I'm unsure.

Based on my example data above and approach, are there more efficient ways of doing what I am doing? Back up option is to break the df into chunks saving to rds each chunk iteration before reading the peaces backinto a df with e.g. do.call().

Is there a better way?

elmstedt · September 3, 2020, 4:28pm

Yes, don't do that at all. That is going to require 66,430 models.

If this is what is being asked of you you should push back because unless there is a very good reason to do this I'm not thinking of, all you're going to come up with is a bunch of noise masquerading as information.

Now as to how you could do this many models without running out of memory? I don't know how much RAM a single lm() value requires, it'll depend on your data since I believe a copy is stored in the output, so all of these will take up a ton of space.

I'd probably just do it in nested for() loops. E.g. do 100 of them, store the results in a list, write the batch list to an RData file, then remove the object from the workspace before continuing.

elmstedt · September 4, 2020, 1:07am

library(microbenchmark) # we will use this later to time our results.
n <- 100
p <- 10
set.seed(123)
df <- data.frame(matrix(runif(n * p), nrow = n))
formulas <- lapply(combn(names(df),
                         2,
                         simplify = FALSE),
                   paste,
                   collapse = " ~ ")


lm_out <- lm(formulas[[1]], df)
model_size <- object.size(lm_out)
print(model_size, standard = "SI", unit = "auto") # single model
#> 39.5 kB

I will multiply the model size by 66430 throughout as that is the number of models you will be creating.

print(model_size * 66430, standard = "SI", unit = "auto")
#> 2.6 GB

We can shrink that a bit with the argument model = FALSE

lm_out <- lm(formulas[[1]], df, model = FALSE)
model_size <- object.size(lm_out) # ~ 40kb
print(model_size * 66430, standard = "SI", unit = "auto") 
#> 2.2 GB

But, it's still huge. We can get a bit smaller by only storing the summary of the lm object. But, remember, this data only has 100 obesrvations...

lm_out <- summary(lm(formulas[[1]], df, model = FALSE))
lm_out
#> Call:
#> lm(formula = formulas[[1]], data = df, model = FALSE)
#> 
#> Residuals:
#>      Min       1Q   Median       3Q      Max 
#> -0.47603 -0.25100 -0.05307  0.26631  0.49798 
#> 
#> Coefficients:
#>             Estimate Std. Error t value Pr(>|t|)    
#> (Intercept)  0.54705    0.06277   8.715 7.39e-14 ***
#> X2          -0.09429    0.10873  -0.867    0.388    
#> ---
#> Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
#> 
#> Residual standard error: 0.2854 on 98 degrees of freedom
#> Multiple R-squared:  0.007616,   Adjusted R-squared:  -0.00251 
#> F-statistic: 0.7521 on 1 and 98 DF,  p-value: 0.3879

We can see the size of the summary of the lm() object is much smaller.

(model_size <- object.size(lm_out))
#> 16184 bytes
print(model_size * 66430, standard = "SI", unit = "auto")
#> 1.1 GB

Let's look at the size of the objects inside a summary.lm() object.

sapply(lm_out, object.size)
#>          call         terms     residuals  coefficients       aliased 
#>           896          3736          7408           968           352 
#>         sigma            df     r.squared adj.r.squared    fstatistic 
#>            56            64            56            56           440 
#>  cov.unscaled 
#>           792

Much of that stuff you can likely do away with for what you appear to be trying to do, and you can always re-create the individual models if you need to. You certainly don't need more than 66 thousand copies of the data!
So, let's drop everything larger than 400 bytes other than the coefficients. (You need 'aliased` to be able to print a somewhat meaningful summary.)

drops <- setdiff(names(lm_out)[sapply(lm_out, object.size) > 400], "coefficients")
lm_out[drops] <- NULL
lm_out
#> Call:
#> NULL
#> 
#> Residuals:
#>    Min     1Q Median     3Q    Max 
#>     NA     NA     NA     NA     NA 
#> 
#> Coefficients:
#>             Estimate Std. Error t value Pr(>|t|)    
#> (Intercept)  0.54705    0.06277   8.715 7.39e-14 ***
#> X2          -0.09429    0.10873  -0.867    0.388    
#> ---
#> Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
#> 
#> Residual standard error: 0.2854 on 98 degrees of freedom
(model_size <- object.size(lm_out))
#> 2448 bytes
print(model_size * 66430, standard = "SI", unit = "auto")
#> 162.6 MB

But, you probably only need the coefficients, sigma, and r.squared for what you appear to be doing. Then, even with very large data, you are unlikely to run into memory constraints.

lm_out <- summary(lm(formulas[[1]], df, model = FALSE))[c("coefficients", "sigma","r.squared")]
lm_out
#> $coefficients
#>               Estimate Std. Error    t value     Pr(>|t|)
#> (Intercept)  0.5470461 0.06277036  8.7150387 7.392230e-14
#> X2          -0.0942916 0.10872549 -0.8672447 3.879265e-01
#> 
#> $sigma
#> [1] 0.2853527
#> 
#> $r.squared
#> [1] 0.007616174

So, for simplicity, let's turn this into a function and see how much space our 45 models here take up.

lm_lite <- function(formula, data) {
  summary(lm(formula, data, model = FALSE))[c("coefficients", "sigma","r.squared")]
}

many_lm <- lapply(formulas, lm_lite, df)
print(object.size(many_lm), standard = "SI", unit = "auto")
#> 69.5 kB

Great. Now, let's make a function to generate your formulas. We'll have it take a vector of variable names rather than an entire data.frame as that makes it easier to signify which variables you want to combine as well as allows you to more easily order them in the event they're not already ordered.

make_formulas <- function(vars) {
  apply(combn(vars, 2), 2, paste, collapse = " ~ ")
}

And now, let's also give ourselves a bit bigger data to work with...

n <- 5000
p <- 500
set.seed(123)
df <- data.frame(matrix(runif(n * p), nrow = n))
print(object.size(df), standard = "SI", unit = "auto")
#> 20.1 MB
formulas <- make_formulas(names(df))
length(formulas)
#> [1] 124750

lm_out <- lm(formulas[[1]], df)

Let's look at how much space all of our models will take up if we use the base lm() function.

model_size <- object.size(lm_out)
print(model_size, standard = "SI", unit = "auto")
#> 1.3 MB
print(model_size * length(formulas), standard = "SI", unit = "auto")
#> 161.4 GB

We can then compare that to the expected size of from our "lm_lite()" function.

lm_out <- lm_lite(formulas[[1]], df)
model_size <- object.size(lm_out)
print(model_size, standard = "SI", unit = "auto")
#> 1.5 kB
print(model_size * length(formulas), standard = "SI", unit = "auto")
#> 191.6 MB

Amazingly, we've realized a nearly 1,000-fold reduction in size!

Now, let's see how long al of this is going to take. We can start by repeatedly benchmarking a subset of 100 models.

f <- formulas[1:100]
# One-hundred models take (on my home PC):
mbr100 <- microbenchmark(
  lm_lite = lapply(f, lm_lite, df),
  times = 100,
  unit = "s"
)
mbr100
#> Unit: seconds
#>     expr       min        lq      mean    median       uq       max neval
#>  lm_lite 0.2345575 0.2430504 0.2485337 0.2464993 0.250418 0.3265849   100

So, in therory, as long as I didn't have memory issues, all ~125,000 of them would take about

mean_lm_lite <- summary(mbr100)[["mean"]]
cat(round(mean_lm_lite * length(formulas) / 100 / 60, 1), "minutes.")
#> 5.2 minutes.

NOTE: This result is perhaps 10% slower than I got when I ran the code myself. reprex introduces some overhead.

Now, let's try to really speed things up by introducing some parallel processing and compare. I'm increasing the batch from 100 forumlas to 1000 to give the parallelization a fair shot.

My home PC has an Intel 7800X processor with 6 physical cores and 12 logical ones. I'll start with 6 cores in the cluster, and if I were doing this for real and it was going to take much, much longer, I would fiddle with the number of cores to assign it until I found what works best on my machine for this task. (I'd also make sure I didn't have a lot of other tasks running at the same time too, which I did not do here.)

library(parallel)

cl <- makeCluster(mc <- getOption("cl.cores", 6))
clusterExport(cl, "df")

par_mbr <- microbenchmark(
  par_lm_lite = parLapply(cl, formulas[1:1000], lm_lite, df),
  times = 100,
  unit = "s"
)
par_mbr
#> Unit: seconds
#>         expr      min        lq      mean    median        uq       max neval
#>  par_lm_lite 0.578859 0.6112421 0.6527206 0.6275443 0.6545273 0.8827877   100

So again, in therory, all ~125,000 of them would take about

mean_par_lm_lite <- summary(par_mbr)[["mean"]]
cat(round(mean_par_lm_lite * length(formulas) / 1000 / 60, 1), "minutes.")
#> 1.4 minutes.

A speedup of about 4 times, which isn't terrible for using 6 cores and not trying to optimize the parallelization.

Finally, let's run this again, but do all the models, just for fun and verify the size of our object.

lml_out <- parLapply(cl, formulas, lm_lite, df)
model_size <- object.size(lml_out)
print(model_size, standard = "SI", unit = "auto")
#> 192.6 MB

I can attest that this did, in fact take under two minutes to run. Also, note the size of the final object is almost exactly what we predicted above.
^{Created on 2020-09-03 by the reprex package (v0.3.0)}

I hope that satisfactorily guides you through how you might tackle something like this.

dougfir · September 4, 2020, 3:00pm

This is really great! Thanks a lot.

system · September 11, 2020, 3:00pm

This topic was automatically closed 7 days after the last reply. New replies are no longer allowed.

If you have a query related to it or one of the replies, start a new topic and refer back with a link.