Random Forest - Variable lenghts differ

Hello you all!

I'm trying to run a random forest and then use the predict function to assest the accuracy of the model.
I have a train database with 7397 rows x 13 features
And a validation database with 2468 rows x 13 features

I first run the random forest function on the train database without any problem but then when I try to predict and assest the accuracy on the validation database I get the error:

Error in model.frame.default(Terms, newdata, na.action = na.omit) : 
  variables lenght differ (found for 'Administrative')
In addition: Warning message:
'newdata' has 2468 rows but the variable found has 7397 rows

So I used a subset from the train db which is a sample with 2468 (the same lenght of the validation db) but I still got the same error.

n_v<-2468
train_2 = sample(1:nrow(online_shoppers_intention_train), n_v)


OSI.ran.forest.3 <- randomForest(Revenue~., data= online_shoppers_intention_train, subset=train_2, mtry=12,importance=TRUE)

yhat.OSI = predict(OSI.ran.forest.3, newdata=validation_db)

The two databases have NOT any missing values, I have already checked.

It's hard to debug/offer advice without a reproducible example / access to your data.

The general pattern you are attempting should work. Not sure what I would have to alter in my example data to create the error you quote.

library(randomForest)
set.seed(71)
n_v<-60
train_2 = sample(1:nrow(iris), n_v)


iris.rf <- randomForest(Species ~ ., data=iris, mtry=3,
                        importance=TRUE,subset = train_2)

validation <- iris[setdiff(1:nrow(iris),train_2),]

(yhat = predict(iris.rf, newdata=validation))

maybe because the validation database comes from a separate file (it has been given to me by my professor) and not from the same train database ? They both come from the same data base that the professor split in 3: train, validation and test.
Train data: about 60% of the units of the original dataset
validation data: about 20% of the units of the original dataset
test data: about 20% of the units of the original dataset
I've found that a col name in the validation database was different from the training one, fixed it but still having the same error. Now the variables length differs is found for "Month" but I can't really understand what is going on. This thing is driving me crazy, i've been trying to fixt it for hours.

install.packages("skimr")

use

skimr::skim(name_of_your_dataset)

to get textual output describing each of your two datasets and share them here ?

This topic was automatically closed 21 days after the last reply. New replies are no longer allowed.

If you have a query related to it or one of the replies, start a new topic and refer back with a link.