Hopefully, some users encountered this before... Or @Max has an advice...
I have a fairly simple goal. I am looking to predict a numeric value based on 3 numeric variables. That would be coordinates (lat, lon) and day of the year (1:365). Simple enough, and
knnreg()is a perfect solution for my needs. It performs great (on tiny chunks that I feed it) and logically makes more sense for the task (I'd do just that manually if I had a tiny dataset: find closest neighbors and average their values).
One problem is that I was never able to run it in full.
knnreg executes, but
predict() can't handle the amount of data.
- my full dataset is 1.9M rows
- 52K data points are missing and require prediction (final goal)
- a 25% test set would be about 480K rows
(5 elements, 198.7 Mb)
I can run it only on tiny sets up of to 5,000 rows.
So, I'm stalling on this first step. Besides the point that I need to do a cross-validation, find a proper
k number, and on top of that I need to predict at least 4 more variables based on 3 original predictors.
Is data size my problem? Should I pick a different algorithm for the job?