I am using Random Forest for prediction purposes. I have training data set, "data1" which has 1200 samples with 28 variables (elements) with labeled classes and the "classes" column has four classifications (ultramafic, mafic, intermediate, and sediment) and was set as factor. I have another data set, named "data2", which has 203 samples with also 28 variables (elements) for prediction. The purpose of this code was to train the 1200 samples in data1 using the Random Forest model and to predict the unknown 203 samples in data2 to classify into four groups (ultramafic, mafic, intermediate and sediment).
The random forest model was built and I used Random Forest "model4" successfully in the training dataset, where "set.seed(1000)" and 70% samples were set as training and 30% samples set as validation. The "predict()" function works very well for data1 to predict validated dataset and returns an "average accuracy" of 97.4%.
However, when I want to use "model4" to predict unknown samples in data2, the error came， where all the values in "prediction()" were "NA". I used "table()" to show the results, also returning "NA".
I checked other solutions and added
to match the levels, however, the problem still remains. If I do not add this sentence, then it returens "Error in confusionMatrix.default(final_prediction, as.factor(data2$classes)) :
the data cannot have more levels than the reference".
Please see my code and datasets attached, anyone can help me to fix my code? Thanks a lot in advance.
The datasets in this code can be downloaded from the link:
Here is the code:
data1<-read.csv("train_data raw.csv",header = TRUE, sep = ",", stringsAsFactors = FALSE) data2<-read.csv("predict_data raw.csv",header = TRUE, sep = ",", stringsAsFactors = FALSE) data1$classes = as.factor(data1$classes) data2$classes = as.factor(data2$classes) str(data1) summary(data1) #split into train and validation sets # training set : validation set = 70:30(random) set.seed(1000) train <- sample(nrow(data1), 0.7*nrow(data1), replace = FALSE) TrainSet <- data1[train,] ValidSet <- data1[-train,] summary(TrainSet) summary(ValidSet) #creat a random Forest model model4 <- randomForest(classes ~., data = TrainSet, mtry = 4, ntree = 500, importance = TRUE, proximity = TRUE, ) model4 #predicting on validate data set predValid <- predict(model4, ValidSet, type = "class") confusionMatrix(predValid, ValidSet$classes) #check classification accuracy mean(predValid ==ValidSet$classes) #predict on unknown samples data2 final_prediction <-predict(model4, data2[,1:28]) table (final_prediction, data2$classes) confusionMatrix(final_prediction, as.factor(data2$classes))