One single model as opposed to multiple models - how to structure model formula

dougfir · August 31, 2020, 8:16pm

I have been given a task from a stakeholder where they have created multiple models. I am trying to figure out if there's a clever way to use one single model and add relevant predictors to get a similar result. There is afterall one single dataframe where the various models come from.

Some data:

library(tidyverse)
library(lubridate)

df <- data.frame(
  cohort = seq(ymd('2019-01-01'), ymd('2019-12-31'), by = '1 days'),
  n = rnorm(365, 1000, 50) %>% round,
  cohort_cost = rnorm(365, 800, 50)
) %>% 
  crossing(tenure_days = 0:365) %>% 
  mutate(activity_date = cohort + days(tenure_days)) %>% 
  mutate(daily_revenue = rnorm(nrow(.), 20, 1)) %>% 
  group_by(cohort) %>% 
  arrange(activity_date) %>% 
  mutate(cumulative_revenue = cumsum(daily_revenue)) %>% 
  arrange(cohort, activity_date) %>% 
  mutate(payback_velocity = round(cumulative_revenue / cohort_cost, 2)) %>% 
  select(cohort, n, cohort_cost, activity_date, tenure_days, everything())

This creates a data frame that looks like this:

> df %>% glimpse
Rows: 133,590
Columns: 8
Groups: cohort [365]
$ cohort             <date> 2019-01-01, 2019-01-01, 2019-01-01, 2019-01-01, 2019-01-01, 2019-01-01, 2019-01-01, 2019-01-01, 2019-01-01, 2019-01-01, 2019-01-01, 2019-01-01, 2019-01-01, 2019-…
$ n                  <dbl> 1041, 1041, 1041, 1041, 1041, 1041, 1041, 1041, 1041, 1041, 1041, 1041, 1041, 1041, 1041, 1041, 1041, 1041, 1041, 1041, 1041, 1041, 1041, 1041, 1041, 1041, 1041, …
$ cohort_cost        <dbl> 760.2979, 760.2979, 760.2979, 760.2979, 760.2979, 760.2979, 760.2979, 760.2979, 760.2979, 760.2979, 760.2979, 760.2979, 760.2979, 760.2979, 760.2979, 760.2979, 76…
$ activity_date      <date> 2019-01-01, 2019-01-02, 2019-01-03, 2019-01-04, 2019-01-05, 2019-01-06, 2019-01-07, 2019-01-08, 2019-01-09, 2019-01-10, 2019-01-11, 2019-01-12, 2019-01-13, 2019-…
$ tenure_days        <int> 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32, 33, 34, 35, 36, 37, 38, 39, 40, 41, 42, …
$ daily_revenue      <dbl> 20.89379, 18.78257, 20.78661, 20.12642, 19.95088, 20.33005, 21.38181, 22.32194, 17.86027, 21.31141, 21.73687, 20.09523, 19.01769, 20.79425, 19.96847, 20.86313, 19…
$ cumulative_revenue <dbl> 20.89379, 39.67636, 60.46297, 80.58940, 100.54028, 120.87033, 142.25214, 164.57408, 182.43435, 203.74576, 225.48263, 245.57785, 264.59555, 285.38980, 305.35826, 3…
$ payback_velocity   <dbl> 0.03, 0.05, 0.08, 0.11, 0.13, 0.16, 0.19, 0.22, 0.24, 0.27, 0.30, 0.32, 0.35, 0.38, 0.40, 0.43, 0.45, 0.48, 0.51, 0.53, 0.56, 0.59, 0.62, 0.64, 0.67, 0.70, 0.72, …
> df %>% head
# A tibble: 6 x 8
# Groups:   cohort [1]
  cohort         n cohort_cost activity_date tenure_days daily_revenue cumulative_revenue payback_velocity
  <date>     <dbl>       <dbl> <date>              <int>         <dbl>              <dbl>            <dbl>
1 2019-01-01  1041        760. 2019-01-01              0          20.9               20.9             0.03
2 2019-01-01  1041        760. 2019-01-02              1          18.8               39.7             0.05
3 2019-01-01  1041        760. 2019-01-03              2          20.8               60.5             0.08
4 2019-01-01  1041        760. 2019-01-04              3          20.1               80.6             0.11
5 2019-01-01  1041        760. 2019-01-05              4          20.0              101.              0.13
6 2019-01-01  1041        760. 2019-01-06              5          20.3              121.              0.16

Context is app installs. The goal is to predict velocity payback. That is, for each day of tenure, how much revenue have we received in total for the group since install as a % of how much we spent on advertising to obtain them.

E.g. the last block above showing the head(df) of the df. The last row of the head shows that we spent 760 on installs that occurred on 2019-01-01 and that by day 6, 2019-01-06 we had received back 121 in revenue for that cohort which is 16% of our spend.

The stakeholder gave me an spreadsheet that has a series of simple models along the lines:

day14_payback_velocity ~ day7_payback_velocity
day30_payback_velocity ~ day14_payback_velocity
day60_payback_velocity ~ day30_payback_velocity
day180_payback_velocity ~ day60_payback_velocity

So they have a simple linear model that predicts payback_velocity where the input variable is the previous arbitrary velocity threshold.

I could make a single model along the lines:

payback_velocity ~ tenure_days

This would have the benefit of having a single model that can make velocity_payback predictions for any horizon just by passing the tenure_days predictor for each of 14, 30, 60, 180 etc.

However, it would miss the predictive input from what the actual payback_velocity was for the previous threshold. E.g. predicting day180_payback_velocity with the existing method would have a single model that benefits from knowing what day60 payback velocity was for that specific cohort. I cannot see a way to provide this in a dynamic manner in the new way that I am thinking off.

[Edit to original post with hopefully clearer wording...]

Suppose that I would like to predict payback velocity for each unique combination of tenure and payback_velocity time horizon.

E.g.

Predict day 180 payback velocity based on day2 payback velocity.
Predict day 10 payback velocity based on day3 payback velocity.
Predict day 100 payback velocity based on day20 payback velocity.
Predict day 100 payback velocity based on day50 payback velocity.
Predict day 100 payback velocity based on day75 payback velocity.

Basically I would like a prediction for every combination of target and input with the condition that target > input. E.g. it's nonsensical to 'predict' day 5 velocity based on day 10 velocity in this context.

Nuclear option would be to create a model for each combination using crossing()/expand.grid() for each combination along with purrr::map functions to create a model for each. Rather than doing that I was hoping that there might be a way to get all of this into a single model?

technocrat · August 31, 2020, 10:25pm

Error in crossing(., tenure_days = 0:365) : 
  could not find function "crossing"

Package?

dougfir · August 31, 2020, 10:28pm

I have tidyverse loaded but I think that crossing comes might come from tidyr

technocrat · August 31, 2020, 11:53pm

With tidyr::crossing(tenure_days = 0:365) the resulting df is as shown. df is a {base} function name, renamed here as x.

f(x) = (y), where x = x, y = velocity payback. For y = y_i ... y_n and y_j ... y_k for x_1 ... x_n where target > x_i

Find f., a matter of composing function objects over data objects. Doable.

dougfir · September 1, 2020, 3:24pm

Thanks for your feedback, I have rewritten a new post that I hope is clearer and am requesting the mods delete this post.

system · September 22, 2020, 3:47pm

This topic was automatically closed 21 days after the last reply. New replies are no longer allowed.

If you have a query related to it or one of the replies, start a new topic and refer back with a link.