Forecasting with fable and with multiple independent variables

ksingh19 · August 20, 2020, 7:53pm

Hi,

Can you please help in using fable forecasting with multiple independent variables to predict the dependent variable "Sales". A small data set is provided below. Thanks!

# Sample Data
df <- data.frame(
             stringsAsFactors = FALSE,
                  check.names = FALSE,
                            Date = c("2012-01-01","2012-02-01","2012-03-01",
                                     "2012-04-01","2012-05-01","2012-06-01",
                                     "2012-07-01","2012-08-01","2012-09-01",
                                     "2012-10-01"),
                        Category = c("Imports","Imports","Imports","Imports",
                                     "Imports","Imports","Imports","Imports",
                                     "Imports","Imports"),
                           Sales = c("1117719","1015562","900476","433667",
                                     "811027","919458","1002769","1139498",
                                     "868219","1046144"),
                 `Interest Rate` = c("10.5","10.5","9.75","9","8.5","8.5","8",
                                     "7.5","7.5","7.25"),
                       Inflation = c("5.6000000000000008E-3",
                                     "4.5000000000000005E-3","2.0999999999999999E-3",
                                     "6.4000000000000003E-3","3.5999999999999999E-3",
                                     "8.0000000000000004E-4","4.3E-3",
                                     "4.0999999999999995E-3","5.6999999999999993E-3",
                                     "5.8999999999999999E-3"),
     `FX USD BRL
(Monthly Avg)` = c("1.7694449999999999","1.7945499999999999",
                                     "1.8409772727272726","1.7467999999999999",
                                     "1.7174","1.8268","1.9083000000000001",
                                     "2.0226999999999999","2.0093999999999999",
                                     "2.0569000000000002")
   )

Creating tsibble object

df <- df %>%
  as_tsibble(key = `Category`, index = Date)

robjhyndman · August 20, 2020, 11:33pm

First you will need to convert the types of data you have to be numeric, and the date to a month.

library(tidyverse)
library(tsibble)
library(fable)

# Convert date to month and other variables to numeric. Then create tsibble
df <- df %>%
  janitor::clean_names() %>%
  mutate(
    date = yearmonth(date),
    sales = as.numeric(sales),
    interest_rate = as.numeric(interest_rate),
    inflation = as.numeric(inflation),
    fx_usd_brl_monthly_avg = as.numeric(fx_usd_brl_monthly_avg)
  ) %>%
  as_tsibble(key = category, index = date)

Now we can fit a regression model. But then you will need to "know" the future values of the predictors when it comes to forecasting. So let's set them at some assumed values for the next 3 months.

# Assumed future values of predictor variables
newx <- new_data(df, 3) %>%
  mutate(
    inflation = 0.003,
    interest_rate = 9,
    fx_usd_brl_monthly_avg = 2
  )

# Fit regression model and forecasts
df %>%
  model(
    fit = TSLM(sales ~ inflation + interest_rate + fx_usd_brl_monthly_avg)
  ) %>%
  forecast(new_data = newx)

#> # A fable: 3 x 8 [1M]
#> # Key:     category, .model [1]
#>   category .model     date               sales  .mean inflation interest_rate
#>   <chr>    <chr>     <mth>              <dist>  <dbl>     <dbl>         <dbl>
#> 1 Imports  fit    2012 Nov N(1233555, 4.8e+10) 1.23e6     0.003             9
#> 2 Imports  fit    2012 Dec N(1233555, 4.8e+10) 1.23e6     0.003             9
#> 3 Imports  fit    2013 Jan N(1233555, 4.8e+10) 1.23e6     0.003             9
#> # … with 1 more variable: fx_usd_brl_monthly_avg <dbl>

^{Created on 2020-08-21 by the reprex package (v0.3.0)}

A regression model with ARIMA errors can be estimated similarly, but replace TSLM with ARIMA.

ksingh19 · August 21, 2020, 2:38am

Thank you so much @robjhyndman!

I do have two more concerns for now:

In the sample data above, I have only 3 numeric variables. But in real data, I have several (23 to be specific apart from category, date, and sales). Some of these variables have future data and some don't. How should I handle the ones that don't have any future data points. Shall I remove them or shall I impute them with median or mean? or I need to provide the last date in each of those variables that don't have future dates.
Because there are several independent variables. What would be the best way to include all the variables and then find the best few among them. Can I use Random Forest or PCA or something else to tackle that part. I have used these two algorithms before with regular data which was not time series data. But I have never used it with R and Time series specifically. How shall I proceed with this?

Thank you!

robjhyndman · August 21, 2020, 2:52am

You will need to either (1) omit variables for which you cannot provide future values, (2) forecast them separately, or (3) provide scenario values for them. See https://otexts.com/fpp3/forecasting-regression.html. Using a mean or median of the historical data is one possible way of forecasting them, but it may not be the best.
Minimize the AIC is best. PCA can also be effective but does not guarantee you the best predictive variables.

ksingh19 · August 23, 2020, 7:03pm

Thanks @robjhyndman!

I read your article which was really helpful in understanding this concept and now am trying to forecast these variables separately, but because many of these stop data input at different dates, I am struggling quite a bit to forecast them. I have created a new reprex to include example of variability in date inputs for different variables, but the original data is huge with several variables and observations.

Here I am trying to find the max date first before forecasting them and getting the error at filter(!is.na(value)) in the second last line of my code below :

Error: Problem with filter() input ..1. x object 'value' not found i Input ..1 is !is.na(value). i The error occured in group 1: ind_variables = "fx".

Am I going in the right direction or is there a better way to forecast them separately? Also, while forecasting these variables separately, I would still need my category as dependent variable, I believe. Please correct me if I am wrong.

Thanks for your help!

# Sample Data
df <- data.frame(
                                               date = c("2018-01-01",
                                                        "2018-02-01",
                                                        "2018-03-01","2018-04-01",
                                                        "2018-05-01","2018-06-01",
                                                        "2018-07-01",
                                                        "2018-08-01","2018-09-01",
                                                        "2018-10-01","2018-11-01",
                                                        "2018-12-01","2019-01-01",
                                                        "2019-02-01",
                                                        "2019-03-01","2019-04-01",
                                                        "2019-05-01","2019-06-01",
                                                        "2019-07-01",
                                                        "2019-08-01","2019-09-01",
                                                        "2019-10-01","2019-11-01",
                                                        "2019-12-01","2020-01-01",
                                                        "2020-02-01",
                                                        "2020-03-01","2020-04-01",
                                                        "2020-05-01","2020-06-01",
                                                        "2020-07-01",
                                                        "2020-08-01","2020-09-01",
                                                        "2020-10-01","2020-11-01",
                                                        "2020-12-01"),
                                          category1 = c(2347748,
                                                        2184392,2471622,2184902,
                                                        1517822,2613641,2060101,
                                                        2498327,2580796,
                                                        2186383,2606230,2669831,
                                                        2089040,2131080,2204262,
                                                        2050067,2226354,
                                                        2292815,2179322,2373398,
                                                        2382017,2497947,2782331,
                                                        2563736,2061292,
                                                        2087140,2136628,449335,
                                                        1105069,1535344,NA,NA,
                                                        NA,NA,NA,NA),
                                          category2 = c(1284719,
                                                        825255,1028516,1125614,
                                                        1311307,1221256,
                                                        1109260.65,1196302,1018945,
                                                        1200231,1148146,913217,
                                                        1277256,980282,853458,
                                                        1007229.58,1246084,
                                                        1193005,1503203.64,
                                                        1451290,1343771,1582470,
                                                        1233360.75588,
                                                        1271090.30412,1337158,1024617,
                                                        969186,580039,745976,
                                                        588006,NA,NA,NA,NA,NA,
                                                        NA),
                                           interest = c(7,6.75,6.5,
                                                        6.5,6.5,6.5,6.5,6.5,
                                                        6.5,6.5,6.5,6.5,
                                                        6.5,6.5,6.5,6.5,6.5,
                                                        6.5,6.5,6,5.5,5,5,
                                                        4.5,4.5,4.25,3.75,
                                                        3.75,3,2.25,2.25,2.25,
                                                        2.25,2.25,2.25,2.25),
                                          inflation = c(0.0029,
                                                        0.0032,9e-04,0.0022,0.004,
                                                        0.0126,0.0033,-9e-04,
                                                        0.0048,0.0045,-0.0021,
                                                        0.0015,0.0032,0.0043,
                                                        0.0075,0.0057,0.0013,
                                                        0.00013234236394144,
                                                        0.0019,0.0011,-4e-04,
                                                        0.001,0.0051,0.0115,
                                                        0.0021,0.0025,7e-04,
                                                        -0.0031,-0.0038,0.0026,
                                                        NA,NA,NA,NA,NA,NA),
                                                 fx = c(3.2099,
                                                        3.2409,3.2786,3.4069,
                                                        3.6355,3.7726,3.8281,
                                                        3.9292,4.1159,3.7578,
                                                        3.786665,3.885055,
                                                        3.74168181818182,3.723625,
                                                        3.8459,3.8956,4.0009,
                                                        3.8582,3.7787,4.0194,
                                                        4.1209,4.0864,4.1547,
                                                        4.1089,4.1489,4.3404,4.88,
                                                        5.325,5.6429,5.196,
                                                        5.35104166666667,
                                                        5.23063333333333,5.254225,
                                                        5.29031666666667,
                                                        5.32640833333333,5.35),
                                                gdp = c(555675.6,
                                                        528921.5,560120.6,
                                                        559332.8,547016.5,580697.8,
                                                        583000.6,582691.2,
                                                        550474.3,587272.8,590537.5,
                                                        601844.5,575506.6,
                                                        563908.6,574201,587270.1,
                                                        599428.1,593574,
                                                        609182.3,607961.9,570176.4,
                                                        613627.6,627545.9,
                                                        647015.4,608395.1,
                                                        595037.7,599988.2,546313.8,
                                                        555930.3,NA,NA,NA,NA,
                                                        NA,NA,NA),
                                              unemp = c(12.2,12.6,
                                                        13.1,12.9,12.7,12.4,
                                                        12.3,12.1,11.9,11.7,
                                                        11.6,11.6,12,12.4,
                                                        12.7,12.5,12.3,12,11.8,
                                                        11.8,11.8,
                                                        11.552647894735,11.2,11,11.2,
                                                        11.6,12.2,12.6,12.9,NA,
                                                        NA,NA,NA,NA,NA,NA),
                                              trade = c(2.824516098,
                                                        2.998715922,6.419975433,
                                                        5.920627359,
                                                        6.064274705,5.789203554,
                                                        3.873510234,2.774624983,
                                                        5.071292182,5.791589075,
                                                        4.07682386,6.428270414,
                                                        1.697146724,3.275887912,
                                                        4.561136714,5.794189064,
                                                        5.686735734,
                                                        5.020100064,2.093569536,
                                                        3.094216555,3.764325507,
                                                        2.555100864,3.515693495,
                                                        5.599341675,-1.67206317,
                                                        2.337179577,3.859122154,
                                                        6.063049758,
                                                        4.272492579,7.463275482,NA,NA,
                                                        NA,NA,NA,NA),
                                           register = c(5295,4823,
                                                        6868,6963,6455,6466,
                                                        8298,8837,8022,9336,
                                                        8908,8868,8338,8118,
                                                        8956,9922,10572,8978,
                                                        10543,11211,10651,
                                                        11326,10514,10018,8299,
                                                        NA,NA,NA,NA,NA,NA,
                                                        NA,NA,NA,NA,NA)
                                       )

# Assigning Present and Forecast Dates
present <- max(df$date[which(!is.na(df[, "category1"]))])
next_fc <- min(df$date[which(is.na(df[, "category1"]))])
last_fc <- max(df$date)

# Finding Max Date of each variable
ind_var <- df%>%
  select(-category1, -category2)%>%
  gather(key = "ind_variables", value = "Value", -date)%>%
  group_by(ind_variables)%>%
  filter(!is.na(value))%>% 
  summarize(max_date = max(date))

robjhyndman · August 23, 2020, 10:42pm

There is no need to do that. You want forecasts for the future values regardless of where the data for the predictor values finishes. So just forecast the period you want using a model that handles missing values. Here is an example for gdp.

library(tidyverse)
library(tsibble)
library(fable)

df %>%
  mutate(month = yearmonth(date)) %>%
  as_tsibble(index=month) %>%
  model(gdp_model = ARIMA(gdp)) %>%
  forecast(h="1 year")
#> # A fable: 12 x 4 [1M]
#> # Key:     .model [1]
#>    .model       month                gdp   .mean
#>    <chr>        <mth>             <dist>   <dbl>
#>  1 gdp_model 2021 Jan N(627775, 4.7e+08) 627775.
#>  2 gdp_model 2021 Feb N(614417, 4.7e+08) 614417.
#>  3 gdp_model 2021 Mar N(619368, 4.7e+08) 619368.
#>  4 gdp_model 2021 Apr N(565694, 4.7e+08) 565694.
#>  5 gdp_model 2021 May N(575310, 4.7e+08) 575310.
#>  6 gdp_model 2021 Jun N(618830, 7.8e+08) 618830.
#>  7 gdp_model 2021 Jul N(647942, 9.5e+08) 647942.
#>  8 gdp_model 2021 Aug N(646721, 9.5e+08) 646721.
#>  9 gdp_model 2021 Sep N(608936, 9.5e+08) 608936.
#> 10 gdp_model 2021 Oct N(652387, 9.5e+08) 652387.
#> 11 gdp_model 2021 Nov N(666305, 9.5e+08) 666305.
#> 12 gdp_model 2021 Dec N(685775, 9.5e+08) 685775.

^{Created on 2020-08-24 by the reprex package (v0.3.0)}

ksingh19 · August 24, 2020, 7:57pm

Thank you @robjhyndman !

How can I add all the variables at once? I was trying do the following and do get the result, but not by variable. I was expecting it to show the result of all the variables in separate columns or rows:

cols <- c(3:9)

df %>%
  mutate(month = yearmonth(date)) %>%
  as_tsibble(index=month) %>%
  model(model = ARIMA(cols)) %>%
  forecast(h="1 year")

robjhyndman · August 24, 2020, 11:58pm

You need to model one variable at a time.

ksingh19 · August 25, 2020, 1:32am

Okay. Is there any other way to handle several variables as in this data, I have 23 or so, but in future if I have way more than that, then modeling one at a time doesn't seem to fit.

robjhyndman · August 25, 2020, 1:47am

If you have a forecasting model that relies on 23 or more other forecasting models, you are likely to get bad results because some of those predictor models will almost certainly give poor results. In addition, your estimate of uncertainty for your main model will not take account of the uncertainty in all the predictor forecasts, and then you will have no idea of the uncertainty in your final forecast.

I would think again about what is the best way to forecast the variable you are interested in. You might find a purely time series model (with no predictors), or a model with a much smaller number of predictors is better.

ksingh19 · August 25, 2020, 1:56am

I see! In that case, I will try to apply random forest or PCA to find few best variables first before proceeding further.

ksingh19 · August 28, 2020, 4:54pm

Hello @robjhyndman!

In your opinion, what should be the maximum number of independent variables to consider while forecasting because of uncertainty as you described earlier.

Thanks for your help!

robjhyndman · August 30, 2020, 1:04am

The issue is whether you can forecast the predictors relatively accurately. You can check if a variable is useful as a predictor by comparing the accuracy of the forecasts (on a test set) when that variable is included in the model, and when it is excluded.

system · September 9, 2020, 12:55pm

This topic was automatically closed 7 days after the last reply. New replies are no longer allowed.

If you have a query related to it or one of the replies, start a new topic and refer back with a link.