I am trying to generate a linear regression to predict the sale price of some cars from the library "imports85", My code is as follows: ´
library(tidyverse) library(rpart) library(rpart.plot) library(randomForest) data("imports85") db<-imports85 View(db) db<-db[,-1 ] db<-db[,-1 ] set.seed(0) library(fastDummies) library(naniar) vis_miss(db) db <- na.omit(db) vis_miss(db) db2 <-dummy_cols(db, select_columns=c("make", "fuelType", "aspiration", "numOfDoors", "bodyStyle", "driveWheels", "engineLocation", "engineType", "numOfCylinders", "fuelSystem"),remove_first_dummy=T, remove_selected_columns=T ) ind <- sample(2, nrow(db2), replace = TRUE, prob = c(0.5, 0.5)) train2 <- db2[ind==1,] test2 <- db2[ind==2,] model <- lm(price ~ ., data = train2) summary(model) classPred2 <- predict(object = model, test2) classPred2
My first question comes from
model <- lm(price ~ ., data = train2). Since I have a lot of columns in the matrix, I cannot write all of them at the right side of the
~. Am I using this way the price to predict the price. Should I remove it from the right part somehow?
My second question comes from
classPred2 <- predict(object = model, test2), I don't know how the prediction works, since I am using test2, which includes the price, which is the variable that I am trying to predict. Should I remove the column in question?
Any answer is appreciated.