library(microbenchmark) # we will use this later to time our results.
n <- 100
p <- 10
set.seed(123)
df <- data.frame(matrix(runif(n * p), nrow = n))
formulas <- lapply(combn(names(df),
2,
simplify = FALSE),
paste,
collapse = " ~ ")
lm_out <- lm(formulas[[1]], df)
model_size <- object.size(lm_out)
print(model_size, standard = "SI", unit = "auto") # single model
#> 39.5 kB
I will multiply the model size by 66430 throughout as that is the number of models you will be creating.
print(model_size * 66430, standard = "SI", unit = "auto")
#> 2.6 GB
We can shrink that a bit with the argument model = FALSE
lm_out <- lm(formulas[[1]], df, model = FALSE)
model_size <- object.size(lm_out) # ~ 40kb
print(model_size * 66430, standard = "SI", unit = "auto")
#> 2.2 GB
But, it's still huge. We can get a bit smaller by only storing the summary of the lm object. But, remember, this data only has 100 obesrvations...
lm_out <- summary(lm(formulas[[1]], df, model = FALSE))
lm_out
#> Call:
#> lm(formula = formulas[[1]], data = df, model = FALSE)
#>
#> Residuals:
#> Min 1Q Median 3Q Max
#> -0.47603 -0.25100 -0.05307 0.26631 0.49798
#>
#> Coefficients:
#> Estimate Std. Error t value Pr(>|t|)
#> (Intercept) 0.54705 0.06277 8.715 7.39e-14 ***
#> X2 -0.09429 0.10873 -0.867 0.388
#> ---
#> Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
#>
#> Residual standard error: 0.2854 on 98 degrees of freedom
#> Multiple R-squared: 0.007616, Adjusted R-squared: -0.00251
#> F-statistic: 0.7521 on 1 and 98 DF, p-value: 0.3879
We can see the size of the summary of the lm() object is much smaller.
(model_size <- object.size(lm_out))
#> 16184 bytes
print(model_size * 66430, standard = "SI", unit = "auto")
#> 1.1 GB
Let's look at the size of the objects inside a summary.lm() object.
sapply(lm_out, object.size)
#> call terms residuals coefficients aliased
#> 896 3736 7408 968 352
#> sigma df r.squared adj.r.squared fstatistic
#> 56 64 56 56 440
#> cov.unscaled
#> 792
Much of that stuff you can likely do away with for what you appear to be trying to do, and you can always re-create the individual models if you need to. You certainly don't need more than 66 thousand copies of the data!
So, let's drop everything larger than 400 bytes other than the coefficients. (You need 'aliased` to be able to print a somewhat meaningful summary.)
drops <- setdiff(names(lm_out)[sapply(lm_out, object.size) > 400], "coefficients")
lm_out[drops] <- NULL
lm_out
#> Call:
#> NULL
#>
#> Residuals:
#> Min 1Q Median 3Q Max
#> NA NA NA NA NA
#>
#> Coefficients:
#> Estimate Std. Error t value Pr(>|t|)
#> (Intercept) 0.54705 0.06277 8.715 7.39e-14 ***
#> X2 -0.09429 0.10873 -0.867 0.388
#> ---
#> Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
#>
#> Residual standard error: 0.2854 on 98 degrees of freedom
(model_size <- object.size(lm_out))
#> 2448 bytes
print(model_size * 66430, standard = "SI", unit = "auto")
#> 162.6 MB
But, you probably only need the coefficients, sigma, and r.squared for what you appear to be doing. Then, even with very large data, you are unlikely to run into memory constraints.
lm_out <- summary(lm(formulas[[1]], df, model = FALSE))[c("coefficients", "sigma","r.squared")]
lm_out
#> $coefficients
#> Estimate Std. Error t value Pr(>|t|)
#> (Intercept) 0.5470461 0.06277036 8.7150387 7.392230e-14
#> X2 -0.0942916 0.10872549 -0.8672447 3.879265e-01
#>
#> $sigma
#> [1] 0.2853527
#>
#> $r.squared
#> [1] 0.007616174
So, for simplicity, let's turn this into a function and see how much space our 45 models here take up.
lm_lite <- function(formula, data) {
summary(lm(formula, data, model = FALSE))[c("coefficients", "sigma","r.squared")]
}
many_lm <- lapply(formulas, lm_lite, df)
print(object.size(many_lm), standard = "SI", unit = "auto")
#> 69.5 kB
Great. Now, let's make a function to generate your formulas. We'll have it take a vector of variable names rather than an entire data.frame as that makes it easier to signify which variables you want to combine as well as allows you to more easily order them in the event they're not already ordered.
make_formulas <- function(vars) {
apply(combn(vars, 2), 2, paste, collapse = " ~ ")
}
And now, let's also give ourselves a bit bigger data to work with...
n <- 5000
p <- 500
set.seed(123)
df <- data.frame(matrix(runif(n * p), nrow = n))
print(object.size(df), standard = "SI", unit = "auto")
#> 20.1 MB
formulas <- make_formulas(names(df))
length(formulas)
#> [1] 124750
lm_out <- lm(formulas[[1]], df)
Let's look at how much space all of our models will take up if we use the base lm() function.
model_size <- object.size(lm_out)
print(model_size, standard = "SI", unit = "auto")
#> 1.3 MB
print(model_size * length(formulas), standard = "SI", unit = "auto")
#> 161.4 GB
We can then compare that to the expected size of from our "lm_lite()" function.
lm_out <- lm_lite(formulas[[1]], df)
model_size <- object.size(lm_out)
print(model_size, standard = "SI", unit = "auto")
#> 1.5 kB
print(model_size * length(formulas), standard = "SI", unit = "auto")
#> 191.6 MB
Amazingly, we've realized a nearly 1,000-fold reduction in size!
Now, let's see how long al of this is going to take. We can start by repeatedly benchmarking a subset of 100 models.
f <- formulas[1:100]
# One-hundred models take (on my home PC):
mbr100 <- microbenchmark(
lm_lite = lapply(f, lm_lite, df),
times = 100,
unit = "s"
)
mbr100
#> Unit: seconds
#> expr min lq mean median uq max neval
#> lm_lite 0.2345575 0.2430504 0.2485337 0.2464993 0.250418 0.3265849 100
So, in therory, as long as I didn't have memory issues, all ~125,000 of them would take about
mean_lm_lite <- summary(mbr100)[["mean"]]
cat(round(mean_lm_lite * length(formulas) / 100 / 60, 1), "minutes.")
#> 5.2 minutes.
NOTE: This result is perhaps 10% slower than I got when I ran the code myself. reprex introduces some overhead.
Now, let's try to really speed things up by introducing some parallel processing and compare. I'm increasing the batch from 100 forumlas to 1000 to give the parallelization a fair shot.
My home PC has an Intel 7800X processor with 6 physical cores and 12 logical ones. I'll start with 6 cores in the cluster, and if I were doing this for real and it was going to take much, much longer, I would fiddle with the number of cores to assign it until I found what works best on my machine for this task. (I'd also make sure I didn't have a lot of other tasks running at the same time too, which I did not do here.)
library(parallel)
cl <- makeCluster(mc <- getOption("cl.cores", 6))
clusterExport(cl, "df")
par_mbr <- microbenchmark(
par_lm_lite = parLapply(cl, formulas[1:1000], lm_lite, df),
times = 100,
unit = "s"
)
par_mbr
#> Unit: seconds
#> expr min lq mean median uq max neval
#> par_lm_lite 0.578859 0.6112421 0.6527206 0.6275443 0.6545273 0.8827877 100
So again, in therory, all ~125,000 of them would take about
mean_par_lm_lite <- summary(par_mbr)[["mean"]]
cat(round(mean_par_lm_lite * length(formulas) / 1000 / 60, 1), "minutes.")
#> 1.4 minutes.
A speedup of about 4 times, which isn't terrible for using 6 cores and not trying to optimize the parallelization.
Finally, let's run this again, but do all the models, just for fun and verify the size of our object.
lml_out <- parLapply(cl, formulas, lm_lite, df)
model_size <- object.size(lml_out)
print(model_size, standard = "SI", unit = "auto")
#> 192.6 MB
I can attest that this did, in fact take under two minutes to run. Also, note the size of the final object is almost exactly what we predicted above.
Created on 2020-09-03 by the reprex package (v0.3.0)
I hope that satisfactorily guides you through how you might tackle something like this.