(Note that I have received feedback about the flaws of the approach below from a statistical sense and have taken onboard that time series might be more appropriate. For the here and now though, I would still like to get my code block to run successfully. So this is more a computation power based question than a stats or modeling question)
I'm seeking suggestions or tips on how to overcome a memory issue. By backup plan would be to write to rds in chunks on a nested loop with purrr::map nested within a regular for loop. But I wondered if there's a better way?
The code block below will result in 2 new data frames
# Set up library(tidyverse) library(lubridate) library(foreach) # Create data mydf <- data.frame( cohort = seq(ymd('2019-01-01'), ymd('2019-12-31'), by = '1 days'), n = rnorm(365, 1000, 50) %>% round, cohort_cost = rnorm(365, 800, 50) ) %>% crossing(tenure_days = 0:365) %>% mutate(activity_date = cohort + days(tenure_days)) %>% mutate(daily_revenue = rnorm(nrow(.), 20, 1)) %>% group_by(cohort) %>% arrange(activity_date) %>% mutate(cumulative_revenue = cumsum(daily_revenue)) %>% arrange(cohort, activity_date) %>% mutate(payback_velocity = round(cumulative_revenue / cohort_cost, 2)) %>% select(cohort, n, cohort_cost, activity_date, tenure_days, everything()) ## wider data mydf_wide <- mydf %>% select(cohort, n, cohort_cost, tenure_days, payback_velocity) %>% group_by(cohort, n, cohort_cost) %>% pivot_wider(names_from = tenure_days, values_from = payback_velocity, names_prefix = 'velocity_day_')
mydf_wide is a dataframe of app install cohorts, the cost 'cohort_cost' that we spent obtaining these installs and then wide columns from day 0 to day 365 for the % of revenue we have received back from each cohort by each day of tenure since install date (cohort date).
mydf_wide %>% head # A tibble: 6 x 369 # Groups: cohort, n, cohort_cost  cohort n cohort_cost velocity_day_0 velocity_day_1 velocity_day_2 velocity_day_3 velocity_day_4 velocity_day_5 velocity_day_6 velocity_day_7 velocity_day_8 <date> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> 1 2019-01-01 1018 856. 0.02 0.05 0.07 0.09 0.12 0.14 0.16 0.18 0.21 2 2019-01-02 1028 873. 0.02 0.05 0.07 0.09 0.12 0.14 0.16 0.19 0.21 3 2019-01-03 965 813. 0.02 0.05 0.07 0.1 0.12 0.15 0.17 0.2 0.22 4 2019-01-04 1026 704. 0.03 0.06 0.09 0.11 0.14 0.17 0.2 0.23 0.25 5 2019-01-05 1036 772. 0.03 0.05 0.08 0.1 0.13 0.16 0.19 0.21 0.24 6 2019-01-06 1089 878. 0.02 0.04 0.07 0.09 0.11 0.14 0.16 0.18 0.21
I need to create linear models to predict day n velocity payback based on velocity payback on a earlier date.
So, for every combination of day1 to day 365 I need to predict this % 'velocity' payback.
models <- data.frame( + from = mydf$tenure_days %>% unique, + to = mydf$tenure_days %>% unique + ) %>% + expand.grid %>% + filter(to > from) %>% + filter(from > 0) %>% + arrange(from) %>% + mutate(mod_formula = paste0('velocity_day_', to, ' ~ velocity_day_', from)) > models %>% head from to mod_formula 1 1 2 velocity_day_2 ~ velocity_day_1 2 1 3 velocity_day_3 ~ velocity_day_1 3 1 4 velocity_day_4 ~ velocity_day_1 4 1 5 velocity_day_5 ~ velocity_day_1 5 1 6 velocity_day_6 ~ velocity_day_1 6 1 7 velocity_day_7 ~ velocity_day_1
Data frame 'models' has a model formula for each combination of 1:365 to predict velocity from day n to day m e.g. predict day 10 payback velocity based on day 2 payback velocity. Predict day 100 payback velocity based on day 56 payback velocity. Etc.
I would now like to mutate a new column onto models that is a linear model of the model definition in field
models <- models %>% mutate(model = purrr::map(mod_formula, ~lm(.x, data = mydf_wide)))
With the example data above, this does eventually complete in my instance of R. However, it's slow. My real data are longer than the example data above and when I run the same my session terminates with message 'unexpected closed session (not exact wording but along those lines'.
Running the individual models themselves takes no time it seems. I guess it's just that R is holding the whole df in memory and that a lm object per row takes up a lot of RAM? I'm unsure.
Based on my example data above and approach, are there more efficient ways of doing what I am doing? Back up option is to break the df into chunks saving to rds each chunk iteration before reading the peaces backinto a df with e.g.
Is there a better way?