Problem in prediction next a randomForest

Back again, but this time with a problem in my prediction following a randomforest analysis.
I know I'm close to the solution, but I can't find anything to fix the problem.

Explanations:
I'm dealing with parcel data (so with lots of NAs everywhere, because otherwise it would be too simple). I have initiated a randomforest which should help me to predict if my specimens are type A or B. Everything is going well on the randomforest side, it's going well, no worries. The bottom hurts as soon as I try to run the predictions on a new data sample (called species_to_predict). I feel like the script is locking on NA rather than making the "no worries" prediction.
I don't know if I'm being very clear, but here's the excerpt from the code:

> species.rf <- randomForest(species.imputed[,1:42], species$hyo_ortho)
> predicted = predict(species.rf, newdata = species_to_predict)

For the randomForest, the selection is only made on 42 of the 43 columns, the last one being my famous A or B (hyo_ortho) types, in order to respect the dimensions.
And so, if I run the script, without surprise I get :

Error in predict.randomForest(species.rf, newdata = species_to_predict, : 
  missing values in newdata

How do you get the script not to read NA and the prediction to "grace" the rest?
Thank you in advance.

So, I tried to do the prediction with also an impute newdataset (I had to impute previously my original dataset "species.rf").
But the problem is that I impute my "sepcies.rf" with my [43] column, wich is my discriminative character (type A ou B). And I can't impute my newdata with the same column 'cause it's the character that we are searching about. So I tried with a another with no missing value character of my newdata.

I have this error message :

> predicted=predict(species.rf,newdata=species-to-pred.imputed)
Error in predict.randomForest(species.rf, newdata = species_to_pred.imputed) : 
  variables in the training data missing in newdata

I really desesperate... I don't understand how to fix my problem to have a prediction on a new dataset...

No, I have a rather patchy matrix in terms of data since it is observable data. My two matrices species & species_to_predict have exactly the same column names for the columns [1:42]. I even redid a c/c just in case...

That's why I think the problem comes mainly from the NAs present in species_to_predict (those in the initial matrix were imputed upstream of the randomForest).

Yes I'll try but I'm not very convinced since imputation is used as a benchmark for splitting the randomForest data.
I hope that someone here knows more about it than we do.

Thanks anyway for taking the time to answer :wink:

I'm taking the liberty of making a little up :crossed_fingers:

This topic was automatically closed 21 days after the last reply. New replies are no longer allowed.

Is the only column with NAs the label column in species_to_predict? Does species_to_predict have the same column names as species.imputed[,1:42]?

Predict works by matching column names to the names in the model, but if there are no column names, species_to_predict needs to be the exact same dimension as the design matrix of the model (42 columns)

Yes it will not predict if you have missing values. The answer to how to handle those missing values is beyond my knowledge. I would recommend building an imputation method that doesn't rely on the class label.