Hi,
I have two data sets df1 and df2 as shown below. While forecasting the sales in these two datasets, I have tried to adjust the actual sales for past few months during the period of COVID-19, as otherwise the forecasting is not optimistic. I am getting reasonable results per our expectations for years 2022-2025. However, the results of 2021 don't seem to be realistic. What is the best way to handle values during pandemic assuming things will be normal soon.
df1 results are fine because sales has picked up back in July with only 3 months of COVID affect. But for df2, the sales is still not near where it was before COVID and thus, adjusting actuals is little confusing for this data with longer COVID affect. Reprex data below shows values from 2019 - 2021 for df2. But real data has values from 2015 - 2025
# Data df2 where sales is still going through COVID affect
df2 <- data.frame(
month = c("2019 Jan","2019 Feb","2019 Mar",
"2019 Apr","2019 May","2019 Jun","2019 Jul","2019 Aug",
"2019 Sep","2019 Oct","2019 Nov","2019 Dec","2020 Jan",
"2020 Feb","2020 Mar","2020 Apr","2020 May","2020 Jun",
"2020 Jul","2020 Aug","2020 Sep","2020 Oct","2020 Nov",
"2020 Dec","2021 Jan","2021 Feb","2021 Mar","2021 Apr",
"2021 May","2021 Jun","2021 Jul","2021 Aug","2021 Sep",
"2021 Oct","2021 Nov","2021 Dec"),
sales = c(1277256,980282,853458,1007229.58,
1246084,1193005,1503203.64,1451290,1343771,1582470,
1233360.75588,1271090.30412,1337158,1024617,969186,580039,
745976,588006,594815,549379.8195,NA,NA,NA,NA,NA,NA,
NA,NA,NA,NA,NA,NA,NA,NA,NA,NA),
A = c(575506.6,563908.6,574201,587270.1,
599428.1,593574,609182.3,607961.9,570176.4,613627.6,
627545.9,647015.4,609026,595426.8,598968.2,545378.5,
557292,553829.3,608986.5,NA,NA,NA,NA,NA,NA,NA,NA,NA,
NA,NA,NA,NA,NA,NA,NA,NA),
B = c(6.5,6.5,6.5,6.5,6.5,6.5,6.5,6,
5.5,5,5,4.5,4.5,4.25,3.75,3.75,3,2.25,2.25,2.125,
2,2.075,2.075,2.075,2.45,2.45,2.45,2.45,2.45,2.45,
2.45,2.45,2.45,2.475,2.7,2.885),
C = c(0.941225,0.895468,0.877436,
0.861329,0.849061,0.828872,0.7482,0.760066,0.737631,0.714138,
0.674789,0.660361,0.606975,0.65002,0.618654,0.561411,
0.452552,0.390509,0.351658,0.342956,0.311385,0.297796,
0.270697,0.249496,0.221222,0.207001,0.204105,0.196503,
0.181137,0.16973,0.156495,0.155258,0.142849,0.138568,
0.127877,0.119766),
D = c(191209,190273,199974,221723,
234599,213995,232861,231522,224014,241786,231590,252037,
184681,193299,156304,51463,56705,122795,163424,173822,
NA,NA,NA,NA,NA,NA,NA,NA,NA,NA,NA,NA,NA,NA,NA,
NA)
)
df2 <- df2%>%
mutate(month = yearmonth(month))
Here is my approach, lots of hard coding, any suggestion on improving it would be great also! Showing example for df2 here.
Actuals
# Actuals Until Feb
act <- df2%>%
slice(1:14)%>%
mutate(month = as.character(month))
#### Adjusting Sales for months affected by COVID - March - July with Forecast
act <- act%>%
mutate(month = yearmonth(month))%>%
as_tsibble(index = month)
covid_sales <- act%>%
gather(key = "variables", value = "value", -month)%>%
model(
arima = ARIMA(value)
)%>%
forecast(h = 5)
covid_sales <- covid_sales%>%
rename("val" = .mean)%>%
select(month, val)%>%
rename("value" = val)%>%
spread(variables, value)%>%
select(month, sales, A, B, C, D)
covid_sales <- covid_sales%>%
select(month, sales)
act_sales <- act%>%
select(month, sales)
# Actuals from August onwards
act_aug_on <- df2%>%
slice(20:36)%>%
select(month, sales)
# Final Actual Sales
act_sales <- bind_rows(act_sales, covid_sales, act_aug_on)
Actuals - Independent Variables
# Actuals Until Feb
act_var <- act%>%
slice(1:14)%>%
mutate(month = yearmonth(month))%>%
as_tsibble(index = month)
act_var <- act_var%>%
select(everything(), -sales)
# Adjusting Actuals for COVID - March - July with forecasts
covid_var <- act_var%>%
gather(key = "variables", value = "value", -month)%>%
model(
arima = ARIMA(value)
)%>%
forecast(h = 5)
covid_var <- covid_var%>%
rename("val" = .mean)%>%
select(month, variables, val)%>%
rename("value" = val)%>%
spread(variables, value)%>%
select(month, A, B, C, D)
# Actuals from August onwards
act_aug_on_var <- df2%>%
slice(20:36)%>%
select(everything(), -sales)
# Final Actuals for Independent Variables
act_var <- bind_rows(act_var, covid_var, act_aug_on_var)
Final Actual Data
df <- act_sales %>% inner_join(act_var)
Forecasting Independent Variables for modeling
#Forecasts for A
A <- df%>%
select(month, A)%>%
slice(1:19)
A_fc <- A%>%
model(A_model = ARIMA(A)) %>%
forecast(h=17)
A_fc <- A_fc%>%
rename("A1" = .mean)%>%
select(month, A1)%>%
rename("A" = A1)
#Forecasts for D
D <- df%>%
select(month, D)%>%
slice(1:19)
D_fc <- D%>%
model(D_model = ARIMA(D)) %>%
forecast(h=17)
D_fc <- D_fc%>%
rename("D1" = .mean)%>%
select(month, D1)%>%
rename("D" = D1)
Final Data
# Combine Independent Variable Forecasts to original data
df <- df %>%
rows_update(A_fc, by = "month") %>%
rows_update(D_fc, by = "month")
# Final Actuals and Forecasted Independent Variables
act <- df%>%
slice(1:20)
factors <- df%>%
slice(21:36)
With this as final data, I built model with just one independent variable C where it performed the best among all the combinations.
Forecast results look great for 2022 - 2025 (not in this reprex data), but values for 2021 doesn't seem to be realistic. What is the best way to handle such data where we are trying to adjust actuals for the COVID period. In order to get more accurate results for 2021,what should be the approach. I am thinking of building a separate model to show values only for 2020 and 2021 with better results and keeping the results of the rest of the years as is. Also, I am thinking there is a more efficient way of achieving same results with less of hard coding and more of functions and loops maybe. Any help with this is much appreciated.
Thanks for your help!