(Note that I have received feedback about the flaws of the approach below from a statistical sense and have taken onboard that time series might be more appropriate. For the here and now though, I would still like to get my code block to run successfully. So this is more a computation power based question than a stats or modeling question)
I'm seeking suggestions or tips on how to overcome a memory issue. By backup plan would be to write to rds in chunks on a nested loop with purrr::map nested within a regular for loop. But I wondered if there's a better way?
The code block below will result in 2 new data frames mydf
and mydf_wide
:
# Set up
library(tidyverse)
library(lubridate)
library(foreach)
# Create data
mydf <- data.frame(
cohort = seq(ymd('2019-01-01'), ymd('2019-12-31'), by = '1 days'),
n = rnorm(365, 1000, 50) %>% round,
cohort_cost = rnorm(365, 800, 50)
) %>%
crossing(tenure_days = 0:365) %>%
mutate(activity_date = cohort + days(tenure_days)) %>%
mutate(daily_revenue = rnorm(nrow(.), 20, 1)) %>%
group_by(cohort) %>%
arrange(activity_date) %>%
mutate(cumulative_revenue = cumsum(daily_revenue)) %>%
arrange(cohort, activity_date) %>%
mutate(payback_velocity = round(cumulative_revenue / cohort_cost, 2)) %>%
select(cohort, n, cohort_cost, activity_date, tenure_days, everything())
## wider data
mydf_wide <- mydf %>%
select(cohort, n, cohort_cost, tenure_days, payback_velocity) %>%
group_by(cohort, n, cohort_cost) %>%
pivot_wider(names_from = tenure_days, values_from = payback_velocity, names_prefix = 'velocity_day_')
Object mydf_wide
is a dataframe of app install cohorts, the cost 'cohort_cost' that we spent obtaining these installs and then wide columns from day 0 to day 365 for the % of revenue we have received back from each cohort by each day of tenure since install date (cohort date).
mydf_wide %>% head
# A tibble: 6 x 369
# Groups: cohort, n, cohort_cost [6]
cohort n cohort_cost velocity_day_0 velocity_day_1 velocity_day_2 velocity_day_3 velocity_day_4 velocity_day_5 velocity_day_6 velocity_day_7 velocity_day_8
<date> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 2019-01-01 1018 856. 0.02 0.05 0.07 0.09 0.12 0.14 0.16 0.18 0.21
2 2019-01-02 1028 873. 0.02 0.05 0.07 0.09 0.12 0.14 0.16 0.19 0.21
3 2019-01-03 965 813. 0.02 0.05 0.07 0.1 0.12 0.15 0.17 0.2 0.22
4 2019-01-04 1026 704. 0.03 0.06 0.09 0.11 0.14 0.17 0.2 0.23 0.25
5 2019-01-05 1036 772. 0.03 0.05 0.08 0.1 0.13 0.16 0.19 0.21 0.24
6 2019-01-06 1089 878. 0.02 0.04 0.07 0.09 0.11 0.14 0.16 0.18 0.21
I need to create linear models to predict day n velocity payback based on velocity payback on a earlier date.
So, for every combination of day1 to day 365 I need to predict this % 'velocity' payback.
models <- data.frame(
+ from = mydf$tenure_days %>% unique,
+ to = mydf$tenure_days %>% unique
+ ) %>%
+ expand.grid %>%
+ filter(to > from) %>%
+ filter(from > 0) %>%
+ arrange(from) %>%
+ mutate(mod_formula = paste0('velocity_day_', to, ' ~ velocity_day_', from))
> models %>% head
from to mod_formula
1 1 2 velocity_day_2 ~ velocity_day_1
2 1 3 velocity_day_3 ~ velocity_day_1
3 1 4 velocity_day_4 ~ velocity_day_1
4 1 5 velocity_day_5 ~ velocity_day_1
5 1 6 velocity_day_6 ~ velocity_day_1
6 1 7 velocity_day_7 ~ velocity_day_1
Data frame 'models' has a model formula for each combination of 1:365 to predict velocity from day n to day m e.g. predict day 10 payback velocity based on day 2 payback velocity. Predict day 100 payback velocity based on day 56 payback velocity. Etc.
I would now like to mutate a new column onto models that is a linear model of the model definition in field mod_formula
:
models <- models %>%
mutate(model = purrr::map(mod_formula, ~lm(.x, data = mydf_wide)))
With the example data above, this does eventually complete in my instance of R. However, it's slow. My real data are longer than the example data above and when I run the same my session terminates with message 'unexpected closed session (not exact wording but along those lines'.
Running the individual models themselves takes no time it seems. I guess it's just that R is holding the whole df in memory and that a lm object per row takes up a lot of RAM? I'm unsure.
Based on my example data above and approach, are there more efficient ways of doing what I am doing? Back up option is to break the df into chunks saving to rds each chunk iteration before reading the peaces backinto a df with e.g. do.call()
.
Is there a better way?