Problem in prediction next a randomForest

CoralineM · June 3, 2020, 7:32pm

Back again, but this time with a problem in my prediction following a randomforest analysis.
I know I'm close to the solution, but I can't find anything to fix the problem.

Explanations:
I'm dealing with parcel data (so with lots of NAs everywhere, because otherwise it would be too simple). I have initiated a randomforest which should help me to predict if my specimens are type A or B. Everything is going well on the randomforest side, it's going well, no worries. The bottom hurts as soon as I try to run the predictions on a new data sample (called species_to_predict). I feel like the script is locking on NA rather than making the "no worries" prediction.
I don't know if I'm being very clear, but here's the excerpt from the code:

> species.rf <- randomForest(species.imputed[,1:42], species$hyo_ortho)
> predicted = predict(species.rf, newdata = species_to_predict)

For the randomForest, the selection is only made on 42 of the 43 columns, the last one being my famous A or B (hyo_ortho) types, in order to respect the dimensions.
And so, if I run the script, without surprise I get :

Error in predict.randomForest(species.rf, newdata = species_to_predict, : 
  missing values in newdata

How do you get the script not to read NA and the prediction to "grace" the rest?
Thank you in advance.

CoralineM · June 4, 2020, 10:49am

So, I tried to do the prediction with also an impute newdataset (I had to impute previously my original dataset "species.rf").
But the problem is that I impute my "sepcies.rf" with my [43] column, wich is my discriminative character (type A ou B). And I can't impute my newdata with the same column 'cause it's the character that we are searching about. So I tried with a another with no missing value character of my newdata.

I have this error message :

> predicted=predict(species.rf,newdata=species-to-pred.imputed)
Error in predict.randomForest(species.rf, newdata = species_to_pred.imputed) : 
  variables in the training data missing in newdata

I really desesperate... I don't understand how to fix my problem to have a prediction on a new dataset...

toryn_stat · June 4, 2020, 2:01pm

Is the only column with NAs the label column in species_to_predict? Does species_to_predict have the same column names as species.imputed[,1:42]?

Predict works by matching column names to the names in the model, but if there are no column names, species_to_predict needs to be the exact same dimension as the design matrix of the model (42 columns)

CoralineM · June 4, 2020, 3:35pm

No, I have a rather patchy matrix in terms of data since it is observable data. My two matrices species & species_to_predict have exactly the same column names for the columns [1:42]. I even redid a c/c just in case...

That's why I think the problem comes mainly from the NAs present in species_to_predict (those in the initial matrix were imputed upstream of the randomForest).

toryn_stat · June 4, 2020, 7:31pm

Yes it will not predict if you have missing values. The answer to how to handle those missing values is beyond my knowledge. I would recommend building an imputation method that doesn't rely on the class label.

CoralineM · June 4, 2020, 8:04pm

Yes I'll try but I'm not very convinced since imputation is used as a benchmark for splitting the randomForest data.
I hope that someone here knows more about it than we do.

Thanks anyway for taking the time to answer

CoralineM · June 5, 2020, 2:55pm

I'm taking the liberty of making a little up

system · June 26, 2020, 2:55pm

This topic was automatically closed 21 days after the last reply. New replies are no longer allowed.