stats::predict()

So if partitions: training + testing = all data I have (historical + present, chronological ascending ordered), what would I feed in the param newdata of function stats::predict() when doing classifications or regressions? Should I use one left out, last data set of allData for newdata?

Since this is a theoretically question, there is no real code for that Max. I already asked the question Julia here, but her answer..."You would use the new data you get moving forward;" goes into another direction.

I don't want 'to move'. I have a finite amount of data.

I 'm searching for use-cases of param newdata in function stats::predict(). When to use it, with which input and when not to use it (regressions/classifications).

I believe, I already know a use-case for forecasts, where I feed param newdata, with my last known dataset, as input, to forecast one step ahead.

My assumption is, there are similar use cases for regressions and classifications.
I guess the projection of the outcome would then go into the present, not into the future.

if you have new data that has not been classified, and you want to classify it with your model, then you would use newdata param of stats::predict to achieve that.

It would be very helpful for you to mock this up with some code so that we can better understand the problem.

1 Like

nirgrahamuk

Thank you for the answer. For your classification example.
My guess is, if all data I have, consist of "old data plus new data",
I would fit on "all data minus new data" and
predict/classify with newdata=new data... is this correct?

everything depends on context.
context of your purpose, and the scenario you are in.
are you in context that requires you to update a model constantly with every new bit of info out in the world that might be available to update it with ? or are you in a context where you have to build a model 'once' and then use it , for a period of time, on new data?

probably the latter is more common setup; and the former is rarer because its relatively expensive, and not often worth the extra expense.

In the later setup, i.e. the typical setup. you wont have 'new data' when you are fitting your data. because if you have it, its not 'new', its just your data. Whatever data gets gathered after you have done work fitting/building a model, will be considered 'new'.

caveat/expansion on what I've said.
Its typically when developing a model, to have testing/training/and validation data.
you might consider the validation data, which you have when you set out to model to be 'new data' because from the point of view of your model build process which should consider only test/train dataset the validation will be left aside, and so is 'new' from the build process.

I know, lets stay at stats::predict(newdata=???)

Well, this would mean, in my case there is no new data, and I would not use param newdata of stats:predict() at all?

My use case is the first one you mentioned. My data is momentum-related with lots of concept drift.

I guess so, if you don't need it, because you have what you need without it then..., you don't need it.
I'll just underline that unless you are netflix, amazon, google or something of that nature I would have really though that the sort of ephemeral, always being updated algorithm is quite rare. I work in the financial industry, and models have to stick around long enough to be versioned controlled, auditable, etc. they very much do have well defined build data, and new data will be streaming into the organisation and not part of the build process which by necessity of practicality and industry practices has to draw a line in what is in the build/out of the build, and be fully documented etc.

I have features/predictors which even exists or not exists, between build processes. So it's definitely the 'rare case'.

I hope there are some more people out there, which could contribute to the question: 'What to do with stats::predict(newdata=???)'

The last comment is: I guess we don't need param newdata in some cases (especially all data consists of new data).

Considering the discussion, we can aggregate the following...

...main question:
Should we use the last chronological ranked dataset of all data, for param newdata of stats::predict(), pretending it would be new data, to generate a prediction?

We have no rule of thumb- or clear answer for now.

If the simple facts are that

  1. you have data for which you lack predictions,
  2. you have a model you could use

Then use predict with the data as newdata param, and get those predictions you wanted.

Just for a clear understanding, what do you mean with 'the data'?

the things you want to have classification of , that you don't already have classifications of.

In these data is there any replication, or is this the temporally ordered data from one (company, field, clinic, ...)?
If this is (for example) stock prices from 100,000 companies then you could train the model using 70% of the companies, validate the model with 20%, and test with the final 10%. You could randomly select the companies and look at how randomization influences outcomes. If these data are the customer flow data through the Acme Firecracker plant, then I do not see how to meaningfully randomize the data or break it up into training-validation-testing sets. Just to build a model you could use the first 80% of the observations to train, and the next 20% to validate, 10% testing. However, you might be better off understanding the data through older methods like ANOVA, regression, or mean comparisons.
You say that you will use the model to predict outcomes for a time. Maybe newdata= is a way for defining how long that time should be. The model predicts X in 1 month. At 1 month we observe X2, and with the new data the model expects X3. When the observed difference between observed and expected exceeds some threshold we retrain the model and start over.

Train, validation and testing are not part of this question.

We have a fitted model and want to know something about param newdata of function stats::predict().

Should we fit on 99% of our data and use the other 1% for param newdata?

Should we fit on 100% of our data and skip param newdata of funciton stats::predict()?

We have, let's say 100 data points, no future data, no nothing.
We trained, validated, tested.
And now, we want to make a prediction in production (one step forward, in regressions and or classifications).