I have a regression model that models log(GrowthRate) ~ log(day)
where GrowthRate is the revenue growth from a cohort of customers day to day and day
is the number of days ago since the cohort installed the app.
e.g. On Jan 1st 2020 a hundred people install our app spend $10 on this day. On Jan 2 2020, day 2, this same group of people (cohort) who installed on Jan 1st spent $9. On day 3 the cohort spent $8 and perhaps by day 365 this cohort only spent $0.01.
My data then looks like this:
I've shown the first 3 days of each cohort, but in reality I have years of historic data and many cohorts have history going back day 730 or more.
I'm happy with the model. It does well on our evaluation metric and the relationship between log(GrowthRate)
and log(day)
is pretty linear.
But, my observations are not independent. Each row is a combination of cohort and day, so cohort like a group. Is it? I don't want to use cohort as a variable, like my research suggested, since my use case is to predict GrowthRate on new cohorts, so the new cohorts would not exist in the training data.
My observations are not independant since I'm monitoring the same cohort over time (day) alongside other cohort day combinations. I'm otherwise happy with the model which predicts well on new unseen data. So what? Why does this matter and what should I do, if anything?