If I were ever tasked with this task, here's how I'd approach it after the tranquilizers kicked in.
- Measure twice, cut once. In this context that means thorough data cleansing and data preservation. I'd approach this through checking in advance that all series are:
a. Numeric
b. Have the same length
c. Have the same start date and end date (or set up process for handling differing time periods)
d. Have no Nulls, NAs, negative values
e. Have no missing values (else, decide on imputation)
f. Have reasonable values‐ideally, a separate source for monthly, quarterly or annual values to check against. This will avoid data transcription errors such as 1 for 100 or 10000 for 100.
Once cleansed, consider storing it in a SQL or other database with audit capability. This may also help with the next consideration. Also, see the archivist package
2. Build to scale. If it takes 10 seconds to run one model, it will take about 28 hours to run 10^5. Some models take longer. Consider error-trapping routines to shuttle off problem series for exception handling. Consider a containerize and dispatch approach and design to that. Think about how to preserve the model output for diagnostics and judgments about outcome.
-
Think functionally: f(x) = y. The argument x should be named identically and the return value y similarly. Differentiating the series and results should be encapsulated completely through the job id. Parameterize everything within the model.
-
Consider the fpp3 package infrastructure,E which has facilities for batch processing models specifically geared to time series.
-
Consider whether to use time series count methods. See Time series count data. I haven't made my own mind up about the necessity for this.
-
Exploratory data analysis with a random sample of say, 20 series.
a. Aggregate the daily data to weekly or monthly, because a forecast horizon of six months is h=180 for daily, which will show unusably wide prediction intervals, possibly showing negative forecasts. Weekly, h=26 is a push, but monthly, h=6 should be tractable.
b. Look at 2020 and pre-2020 separately because of the possibility of a cyclic trend last year.
c. For each series: plot the series, plot the ACF and test for stationarity.
d. Run base naive, mean, random walk with drift and seasonal naive models.
e. Choose a metric, e.g., RMSE or MAE.
f. Divide the series into upto 200 training set observations and run predict with h=6.
g. Test the residuals for autocorrleation of the residuals; consider differencing or transformation if required.
h. See which model is superior. Assess whether it is consistently superior. If it varies by series, decide on a single model as the benchmark. The ARIMA and other models will have to beat this bogey.
- Design and pilot the workflow.
I'll pray for you.