Which Predictive analytics model to use to predict client volume in R

Hello,

I am new to predictive analytics. I am trying to predict client volume (how many clients will enter a particular education program) over the next 5 years based on client volume data that I have for the past 2 years (2021, 2022).

Here are the data I have available - Client ID, Client Start Date (format: 2021-12-26), Client End Date(same format as start date), Education Location, Education Provider (there are 3 education providers), Start fiscal year, Start fiscal quarter

I have done some research and the seasonally naive model was one of the first one that jumped out but I wanted to ask your expert advice on how to go about doing this kind of work. Based on the data I have available, how can I go about doing the predictive analytics and which model would be the best to use?

Thank you

Review Hyndman & Athanasopoulos for an introduction to forecast techniques. A forecast extrapolates past data assuming that future patterns will follow those past patterns. A prediction assumes some change to those patterns.

Most forecasts amount to

the near future will be like the recent past only a little more or a little less so

The statistical nature of forecasts of non-deterministic events is that they are subject to random variation making forecasts also subject to confidence intervals that widen as the forecast horizon is extended. As a consequence, depending on the data, there may be surprises such as a lower confidence level becoming negative within the forecast horizon or the upper bound becoming positive in excess of some hard ceiling.

Depending on data and purpose, initial modeling should usually include four baseline models: mean, naive, seasonally naive and random walk. These would be run on a 75/25 split into training/test sets and [pre]selecting a metric, such as RMSE to select a benchmark against which ARIMA and other models will be judged.

@technocrat thank you for providing the link. I would also like to predict duration of the program (from client start date to client end date) the problem with the mean data is that since the current durations can range from 2 days to 600 days, I have to take a look at the median data instead of the mean data. I might have to look at the median client volume instead of the mean since Out of the 3 education providers, one of them has had significantly less clients over the past two fiscal years. Based on your expertise would it be possible to look at the median, naive, seasonally naive and random walk?
So in terms of modelling, would you suggest the ARIMA model to be a good place to start? I would really appreciate your feedback. Thank you