Which Predictive analytics model to use to predict client volume in R

Hello,

I am new to predictive analytics. I am trying to predict client volume (how many clients will enter a particular education program) over the next 5 years based on client volume data that I have for the past 2 years (2021, 2022).

Here are the data I have available - Client ID, Client Start Date (format: 2021-12-26), Client End Date(same format as start date), Education Location, Education Provider (there are 3 education providers), Start fiscal year, Start fiscal quarter

I have done some research and the seasonally naive model was one of the first one that jumped out but I wanted to ask your expert advice on how to go about doing this kind of work. Based on the data I have available, how can I go about doing the predictive analytics and which model would be the best to use?

Thank you

Review Hyndman & Athanasopoulos for an introduction to forecast techniques. A forecast extrapolates past data assuming that future patterns will follow those past patterns. A prediction assumes some change to those patterns.

Most forecasts amount to

the near future will be like the recent past only a little more or a little less so

The statistical nature of forecasts of non-deterministic events is that they are subject to random variation making forecasts also subject to confidence intervals that widen as the forecast horizon is extended. As a consequence, depending on the data, there may be surprises such as a lower confidence level becoming negative within the forecast horizon or the upper bound becoming positive in excess of some hard ceiling.

Depending on data and purpose, initial modeling should usually include four baseline models: mean, naive, seasonally naive and random walk. These would be run on a 75/25 split into training/test sets and [pre]selecting a metric, such as RMSE to select a benchmark against which ARIMA and other models will be judged.

@technocrat thank you for providing the link. I would also like to predict duration of the program (from client start date to client end date) the problem with the mean data is that since the current durations can range from 2 days to 600 days, I have to take a look at the median data instead of the mean data. I might have to look at the median client volume instead of the mean since Out of the 3 education providers, one of them has had significantly less clients over the past two fiscal years. Based on your expertise would it be possible to look at the median, naive, seasonally naive and random walk?
So in terms of modelling, would you suggest the ARIMA model to be a good place to start? I would really appreciate your feedback. Thank you

if someone could please recommend some models to look into, I would really appreciate that. Please let me know if you would like to view a sample of the data

Sample data (concrete is easier than abstract sometimes)

would aggregate of the data work help? For example having the data in this format:
image

please let me know and I appreciate the help :slight_smile:

f(x) = y

where

x is the data frame with variables Provider, Start_month, Start_qr and client_volume

y is the

over some forecast horizon.

f is the function or composite function to convert x to y

The first requirement for f is to calculate the duration of each completed program (A,B,C\dots) from the variables contained in x.

x is clearly insufficient to make that calculation because the only date information provided is a variable with an open ended range—there is a start date but no end date.

The purpose of a reprex. See the FAQ is to provide information sufficient to reflect the problem. The data doesn't have to be all the data at hand, all of the variables, or even the same data. Representative data from a built-in data set is fine or even synthetic data. But it must be complete to the extent needed to set up the problem. Here it is not.

@technocrat thank you, I have used the reprex function to reproduce my full data. Right now, I am trying to forecast client volume only based on start date. Here is the full aggregated client volume data:

s_data <-tibble::tribble(
  ~provider_name, ~start_month, ~volume,
             "A",   "2021 Jan",    122L,
             "A",   "2021 Feb",    222L,
             "A",   "2021 Mar",    300L,
             "A",   "2021 Apr",    216L,
             "A",   "2021 May",    173L,
             "A",   "2021 Jun",    177L,
             "A",   "2021 Jul",    174L,
             "A",   "2021 Aug",    196L,
             "A",   "2021 Sep",    201L,
             "A",   "2021 Oct",    214L,
             "A",   "2021 Nov",    227L,
             "A",   "2021 Dec",    218L,
             "A",   "2022 Jan",    211L,
             "A",   "2022 Feb",    214L,
             "A",   "2022 Mar",    269L,
             "A",   "2022 Apr",    228L,
             "A",   "2022 May",    288L,
             "A",   "2022 Jun",    319L,
             "A",   "2022 Jul",    296L,
             "A",   "2022 Aug",    323L,
             "A",   "2022 Sep",    320L,
             "A",   "2022 Oct",    331L,
             "A",   "2022 Nov",    295L,
             "A",   "2022 Dec",    237L,
             "A",   "2023 Jan",    325L,
             "A",   "2023 Feb",    298L,
             "B",   "2021 Jan",     52L,
             "B",   "2021 Feb",     74L,
             "B",   "2021 Mar",     69L,
             "B",   "2021 Apr",     46L,
             "B",   "2021 May",     44L,
             "B",   "2021 Jun",     49L,
             "B",   "2021 Jul",     45L,
             "B",   "2021 Aug",     52L,
             "B",   "2021 Sep",     68L,
             "B",   "2021 Oct",     79L,
             "B",   "2021 Nov",     61L,
             "B",   "2021 Dec",     56L,
             "B",   "2022 Jan",     64L,
             "B",   "2022 Feb",     58L,
             "B",   "2022 Mar",     75L,
             "B",   "2022 Apr",     47L,
             "B",   "2022 May",     63L,
             "B",   "2022 Jun",     78L,
             "B",   "2022 Jul",     64L,
             "B",   "2022 Aug",     93L,
             "B",   "2022 Sep",     68L,
             "B",   "2022 Oct",     54L,
             "B",   "2022 Nov",     78L,
             "B",   "2022 Dec",     55L,
             "B",   "2023 Jan",    106L,
             "B",   "2023 Feb",     85L,
             "C",   "2021 Jan",    135L,
             "C",   "2021 Feb",    231L,
             "C",   "2021 Mar",    278L,
             "C",   "2021 Apr",    246L,
             "C",   "2021 May",    289L,
             "C",   "2021 Jun",    404L,
             "C",   "2021 Jul",    304L,
             "C",   "2021 Aug",    288L,
             "C",   "2021 Sep",    270L,
             "C",   "2021 Oct",    298L,
             "C",   "2021 Nov",    334L,
             "C",   "2021 Dec",    274L,
             "C",   "2022 Jan",    362L,
             "C",   "2022 Feb",    278L,
             "C",   "2022 Mar",    315L,
             "C",   "2022 Apr",    310L,
             "C",   "2022 May",    328L,
             "C",   "2022 Jun",    413L,
             "C",   "2022 Jul",    369L,
             "C",   "2022 Aug",    430L,
             "C",   "2022 Sep",    442L,
             "C",   "2022 Oct",    429L,
             "C",   "2022 Nov",    439L,
             "C",   "2022 Dec",    336L,
             "C",   "2023 Jan",    413L,
             "C",   "2023 Feb",    383L
  )

I have 3 providers and I am trying to forecast the client volume for the next 1 year. I have used these forecast model and used this function to forecast client volume for the next 1 year, however, I am interested in knowing how you would forecast the data since I am having doubts about the accuracy of the model and I am getting errors about missing data:

data_stretch <- s_data %>%
  stretch_tsibble(.init = 12, .step = 1) %>%
  filter(.id !=max(.id))


fit <- data_stretch %>%
  model(
    #seasonal_naive = SNAIVE(volume),
    naive_a = NAIVE(volume),
    drift_a = RW(volume ~ drift()),
    #rw_dr = RW(log(volume) ~ drift()),
    #mean_a = MEAN(volume),
    tslm_a = TSLM(volume ~ trend()), 
    ets = ETS(volume),
    arima = ARIMA(volume)
  )  %>%
  forecast(h = "1 year")

fit %>%
  accuracy(s_data)

This is a good reprex except s_data should be a tsibble rather than a tibble. You would have converted s_data$start_month from typeof character to typeof some time type. And you cut and paste the output of dput(s_data) so that there's the tsibble to continue following along.

@technocrat can you tell me how I can convert it in reprex? because when I look at the tibble in R for s_data, start_month is shown as mth:
image

At some point s_data must have been a tsibble object, because otherwise stretch_tsibble() would have failed like this

library(tsibble)
#> 
#> Attaching package: 'tsibble'
#> The following objects are masked from 'package:base':
#> 
#>     intersect, setdiff, union
s_data <-tibble::tribble(
  ~provider_name, ~start_month, ~volume,
  "A",   "2021 Jan",    122,
  "A",   "2021 Feb",    222,
  "A",   "2021 Mar",    300,
  "A",   "2021 Apr",    216,
  "A",   "2021 May",    173,
  "A",   "2021 Jun",    177,
  "A",   "2021 Jul",    174,
  "A",   "2021 Aug",    196,
  "A",   "2021 Sep",    201,
  "A",   "2021 Oct",    214,
  "A",   "2021 Nov",    227,
  "A",   "2021 Dec",    218,
  "A",   "2022 Jan",    211,
  "A",   "2022 Feb",    214,
  "A",   "2022 Mar",    269,
  "A",   "2022 Apr",    228,
  "A",   "2022 May",    288,
  "A",   "2022 Jun",    319,
  "A",   "2022 Jul",    296,
  "A",   "2022 Aug",    323,
  "A",   "2022 Sep",    320,
  "A",   "2022 Oct",    331,
  "A",   "2022 Nov",    295,
  "A",   "2022 Dec",    237,
  "A",   "2023 Jan",    325,
  "A",   "2023 Feb",    298,
  "B",   "2021 Jan",     52,
  "B",   "2021 Feb",     74,
  "B",   "2021 Mar",     69,
  "B",   "2021 Apr",     46,
  "B",   "2021 May",     44,
  "B",   "2021 Jun",     49,
  "B",   "2021 Jul",     45,
  "B",   "2021 Aug",     52,
  "B",   "2021 Sep",     68,
  "B",   "2021 Oct",     79,
  "B",   "2021 Nov",     61,
  "B",   "2021 Dec",     56,
  "B",   "2022 Jan",     64,
  "B",   "2022 Feb",     58,
  "B",   "2022 Mar",     75,
  "B",   "2022 Apr",     47,
  "B",   "2022 May",     63,
  "B",   "2022 Jun",     78,
  "B",   "2022 Jul",     64,
  "B",   "2022 Aug",     93,
  "B",   "2022 Sep",     68,
  "B",   "2022 Oct",     54,
  "B",   "2022 Nov",     78,
  "B",   "2022 Dec",     55,
  "B",   "2023 Jan",    106,
  "B",   "2023 Feb",     85,
  "C",   "2021 Jan",    135,
  "C",   "2021 Feb",    231,
  "C",   "2021 Mar",    278,
  "C",   "2021 Apr",    246,
  "C",   "2021 May",    289,
  "C",   "2021 Jun",    404,
  "C",   "2021 Jul",    304,
  "C",   "2021 Aug",    288,
  "C",   "2021 Sep",    270,
  "C",   "2021 Oct",    298,
  "C",   "2021 Nov",    334,
  "C",   "2021 Dec",    274,
  "C",   "2022 Jan",    362,
  "C",   "2022 Feb",    278,
  "C",   "2022 Mar",    315,
  "C",   "2022 Apr",    310,
  "C",   "2022 May",    328,
  "C",   "2022 Jun",    413,
  "C",   "2022 Jul",    369,
  "C",   "2022 Aug",    430,
  "C",   "2022 Sep",    442,
  "C",   "2022 Oct",    429,
  "C",   "2022 Nov",    439,
  "C",   "2022 Dec",    336,
  "C",   "2023 Jan",    413,
  "C",   "2023 Feb",    383
)


s_data |> tsibble::stretch_tsibble(.init = 12, .step = 2) 
#> Error in UseMethod("key_data"): no applicable method for 'key_data' applied to an object of class "c('tbl_df', 'tbl', 'data.frame')"

Created on 2023-03-31 with reprex v2.0.2

Run the following command

dput(s_data)

and cut-and-paste the console output.

@technocrat thank you so much.

I ran this code below first to turn it into a tsibble so that it is easier for you to copy and paste, please let me know if this works for you

s_data %>%
  mutate(start_month = tsibble::yearmonth(start_month)) %>%
  as_tsibble(index = start_month,
             key = provider_name) 
> dput(s_data)
structure(list(provider_name = c("A", "A", "A", "A", "A", "A", 
"A", "A", "A", "A", "A", "A", "A", "A", "A", "A", "A", "A", "A", 
"A", "A", "A", "A", "A", "A", "A", "B", "B", "B", "B", "B", "B", 
"B", "B", "B", "B", "B", "B", "B", "B", "B", "B", "B", "B", "B", 
"B", "B", "B", "B", "B", "B", "B", "C", "C", "C", "C", "C", "C", 
"C", "C", "C", "C", "C", "C", "C", "C", "C", "C", "C", "C", "C", 
"C", "C", "C", "C", "C", "C", "C"), start_month = c("2021 Jan", 
"2021 Feb", "2021 Mar", "2021 Apr", "2021 May", "2021 Jun", "2021 Jul", 
"2021 Aug", "2021 Sep", "2021 Oct", "2021 Nov", "2021 Dec", "2022 Jan", 
"2022 Feb", "2022 Mar", "2022 Apr", "2022 May", "2022 Jun", "2022 Jul", 
"2022 Aug", "2022 Sep", "2022 Oct", "2022 Nov", "2022 Dec", "2023 Jan", 
"2023 Feb", "2021 Jan", "2021 Feb", "2021 Mar", "2021 Apr", "2021 May", 
"2021 Jun", "2021 Jul", "2021 Aug", "2021 Sep", "2021 Oct", "2021 Nov", 
"2021 Dec", "2022 Jan", "2022 Feb", "2022 Mar", "2022 Apr", "2022 May", 
"2022 Jun", "2022 Jul", "2022 Aug", "2022 Sep", "2022 Oct", "2022 Nov", 
"2022 Dec", "2023 Jan", "2023 Feb", "2021 Jan", "2021 Feb", "2021 Mar", 
"2021 Apr", "2021 May", "2021 Jun", "2021 Jul", "2021 Aug", "2021 Sep", 
"2021 Oct", "2021 Nov", "2021 Dec", "2022 Jan", "2022 Feb", "2022 Mar", 
"2022 Apr", "2022 May", "2022 Jun", "2022 Jul", "2022 Aug", "2022 Sep", 
"2022 Oct", "2022 Nov", "2022 Dec", "2023 Jan", "2023 Feb"), 
    volume = c(122L, 222L, 300L, 216L, 173L, 177L, 174L, 196L, 
    201L, 214L, 227L, 218L, 211L, 214L, 269L, 228L, 288L, 319L, 
    296L, 323L, 320L, 331L, 295L, 237L, 325L, 298L, 52L, 74L, 
    69L, 46L, 44L, 49L, 45L, 52L, 68L, 79L, 61L, 56L, 64L, 58L, 
    75L, 47L, 63L, 78L, 64L, 93L, 68L, 54L, 78L, 55L, 106L, 85L, 
    135L, 231L, 278L, 246L, 289L, 404L, 304L, 288L, 270L, 298L, 
    334L, 274L, 362L, 278L, 315L, 310L, 328L, 413L, 369L, 430L, 
    442L, 429L, 439L, 336L, 413L, 383L)), class = c("tbl_df", 
"tbl", "data.frame"), row.names = c(NA, -78L))
1 Like

wondering if anyone could help me with this analysis, I would appreciate the help :slight_smile:

some googling around tells me quantreg package has function dynrq() for quantile regression over time-series

This topic was automatically closed 42 days after the last reply. New replies are no longer allowed.

If you have a query related to it or one of the replies, start a new topic and refer back with a link.