Random Forrest regression on Titanic dataset

Hey!, I'm trying to do a random forest regression on the titanic data set. Just a basic one to generate a model. However this error seems to keep popping up everytime i try and run the model

setwd("~/Desktop/Commbank")
titanic.train <- read.csv(file = "train.csv", stringsAsFactors = FALSE, head=TRUE)
titanic.test <- read.csv(file = "test.csv", stringsAsFactors = FALSE, head=TRUE)
titanic.test$IsTrainSet <- TRUE
titanic.train$IsTrainSet <- FALSE
titanic.test$Survived <- NA
titanic.full <- rbind(titanic.train, titanic.test)
str(titanic.full$Embarked)

table(is.na(titanic.full$Embarked))
titanic.full[titanic.full$Embarked == ' ', "Embarked"] <- 'S'
titanic.full$Embarked <- as.factor(titanic.full$Embarked)
titanic.full$Pclass <- as.factor(titanic.full$Pclass)
titanic.full$Sex <- as.factor(titanic.full$Sex)
titanic.full$Survived <- as.factor(titanic.full$Survived)
str(titanic.full)
boxplot(titanic.full$Fare)
boxplot.stats(titanic.full$Fare)
upper_bound <-boxplot.stats(titanic.full$Fare)$stats[5]
outlier_removal <- titanic.full$Fare < upper_bound
titanic.full[outlier_removal, ]
fare.equation = "Fare ~ Pclass + Sex + Age + SibSp + Parch + Embarked"
fare.model <- lm(
formula = fare.equation,
data = titanic.full[outlier_removal, ]
)

fare.row <- titanic.full[is.na(titanic.full$Fare), c("Pclass", "Sex", "Age", "SibSp", "Parch", "Embarked")]

fare.pred <- predict(fare.model, newdata = fare.row)
titanic.full[is.na(titanic.full$Fare), "Fare"] <- fare.pred
titanic.full[1044, ]
str(titanic.full)
library(rpart)
Agefit <- rpart(Age ~ Pclass + Sex + SibSp + Parch + Fare + Embarked,
data=titanic.full[!is.na(titanic.full$Age),],
method="anova")

Age.row <- titanic.full[is.na(titanic.full$Age), c("Pclass", "Sex", "Fare", "SibSp", "Parch", "Embarked")]

Age.pred <- predict(Agefit, newdata = Age.row)
titanic.full[is.na(titanic.full$Age), "Age"] <- Age.pred
table(is.na(titanic.full$Age))
titanic.train <- titanic.full[titanic.full$IsTrainSet==TRUE,]
titanic.test <- titanic.full[titanic.full$IsTrainSet==FALSE,]

titanic.train$Survived <- as.factor(titanic.train$Survived)
table(is.na(titanic.train$Age))
titanic.train$Age

survived.equation <- "Survived ~ Pclass + Sex + Age + SibSp + Parch + Fare + Embarked"
survived.formula <- as.formula(survived.equation)
library(randomForest)
titanic.model <- randomForest(formula = survived.formula, data = titanic.train, ntree = 500, mtry = 3, nodesize = 0.01 * nrow(titanic.test))

the error message reads:
Error in na.fail.default(list(Survived = c(NA_integer_, NA_integer__, NA_integer_, : missing values in object

Would really apreciate any help

This topic was automatically closed 21 days after the last reply. New replies are no longer allowed.

If you have a query related to it or one of the replies, start a new topic and refer back with a link.