Random Forest to predict unknown samples but Predict () doesn't work with my unknown samples and returns a lot of “NA”

I am using Random Forest for prediction purposes. I have training data set, "data1" which has 1200 samples with 28 variables (elements) with labeled classes and the "classes" column has four classifications (ultramafic, mafic, intermediate, and sediment) and was set as factor. I have another data set, named "data2", which has 203 samples with also 28 variables (elements) for prediction. The purpose of this code was to train the 1200 samples in data1 using the Random Forest model and to predict the unknown 203 samples in data2 to classify into four groups (ultramafic, mafic, intermediate and sediment).

The random forest model was built and I used Random Forest "model4" successfully in the training dataset, where "set.seed(1000)" and 70% samples were set as training and 30% samples set as validation. The "predict()" function works very well for data1 to predict validated dataset and returns an "average accuracy" of 97.4%.

However, when I want to use "model4" to predict unknown samples in data2, the error came, where all the values in "prediction()" were "NA". I used "table()" to show the results, also returning "NA".

I checked other solutions and added

levels(data2$classes) <-levels(TrainSet$classes)

to match the levels, however, the problem still remains. If I do not add this sentence, then it returens "Error in confusionMatrix.default(final_prediction, as.factor(data2$classes)) :
the data cannot have more levels than the reference".
Please see my code and datasets attached, anyone can help me to fix my code? Thanks a lot in advance.

The datasets in this code can be downloaded from the link:
https://drive.google.com/drive/folders/1Wn-4bXHw1vEYLhpJGQH9PmWC7pvKPSUz?usp=sharing
Here is the code:

data1<-read.csv("train_data raw.csv",header = TRUE, 
               sep = ",", stringsAsFactors = FALSE)
data2<-read.csv("predict_data raw.csv",header = TRUE, 
                sep = ",", stringsAsFactors = FALSE)
data1$classes = as.factor(data1$classes)
data2$classes = as.factor(data2$classes)
str(data1)
summary(data1)

#split into train and validation sets
# training set : validation set = 70:30(random)

set.seed(1000)
train <- sample(nrow(data1), 0.7*nrow(data1), replace = FALSE)
TrainSet <- data1[train,]
ValidSet <- data1[-train,]

summary(TrainSet)
summary(ValidSet)
#creat a random Forest model
model4 <- randomForest(classes ~.,
                       data = TrainSet,
                       mtry = 4,
                       ntree = 500,
                       importance = TRUE,
                       proximity = TRUE,
)
model4
#predicting on validate data set
predValid <- predict(model4, ValidSet, type = "class")
confusionMatrix(predValid, ValidSet$classes)
#check classification accuracy
mean(predValid ==ValidSet$classes)

#predict on unknown samples
data2
final_prediction <-predict(model4, data2[,1:28])
table (final_prediction, data2$classes)
confusionMatrix(final_prediction, as.factor(data2$classes))

This topic was automatically closed 21 days after the last reply. New replies are no longer allowed.

If you have a query related to it or one of the replies, start a new topic and refer back with a link.