Hello colleagues,
I am trying to run a logistic regression on the well known titanic
dataset. The data has already been split into train.data
and test.data
. Moreover, we have done the modeling as per the formula,
logit.mod.2 <- glm(survived~sex,data=train.data,family=binomial)
summary(logit.mod.2)
Call:
glm(formula = survived ~ sex, family = binomial, data = train.data)
Deviance Residuals:
Min 1Q Median 3Q Max
-1.6824 -0.6705 -0.6705 0.7459 1.7904
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) 1.1371 0.1394 8.159 3.38e-16 ***
sexmale -2.5151 0.1822 -13.807 < 2e-16 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
(Dispersion parameter for binomial family taken to be 1)
Null deviance: 992.97 on 731 degrees of freedom
Residual deviance: 764.46 on 730 degrees of freedom
AIC: 768.46
Number of Fisher Scoring iterations: 4
We see that since it is. formula based method it has automatically created dummy variables for the predictor, Sex
. Next we use it to predict on the test.data
.
logit.mod.2.preds_test <- predict(logit.mod.2,test.data, type="response")
predicted_label_test_2 <- ifelse(logit.mod.2.preds_test>0.5,1,0)
table(predicted_label_test_2,test.data$survived)
Now I have 2 questions on it;
-
First, the test data does not have any variable by the name,
sexmale
. How will the model know which variable to use for prediction? -
Second, will it be helpful to take out all the columns from test set, except
sex
? Like, in Python we usetrain-test-split
which separates predictors and predicted variable.
Can I kindly get some feedback? Help is appreciated. thanks