How does glm.predict deal with dummy variables

Hello colleagues,
I am trying to run a logistic regression on the well known titanic dataset. The data has already been split into train.data and test.data. Moreover, we have done the modeling as per the formula,

logit.mod.2 <- glm(survived~sex,data=train.data,family=binomial)
summary(logit.mod.2)
Call:
glm(formula = survived ~ sex, family = binomial, data = train.data)

Deviance Residuals: 
    Min       1Q   Median       3Q      Max  
-1.6824  -0.6705  -0.6705   0.7459   1.7904  

Coefficients:
            Estimate Std. Error z value Pr(>|z|)    
(Intercept)   1.1371     0.1394   8.159 3.38e-16 ***
sexmale      -2.5151     0.1822 -13.807  < 2e-16 ***
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

(Dispersion parameter for binomial family taken to be 1)

    Null deviance: 992.97  on 731  degrees of freedom
Residual deviance: 764.46  on 730  degrees of freedom
AIC: 768.46

Number of Fisher Scoring iterations: 4

We see that since it is. formula based method it has automatically created dummy variables for the predictor, Sex. Next we use it to predict on the test.data.

logit.mod.2.preds_test <- predict(logit.mod.2,test.data, type="response")
predicted_label_test_2 <- ifelse(logit.mod.2.preds_test>0.5,1,0)
table(predicted_label_test_2,test.data$survived)

Now I have 2 questions on it;

  • First, the test data does not have any variable by the name, sexmale. How will the model know which variable to use for prediction?

  • Second, will it be helpful to take out all the columns from test set, except sex? Like, in Python we use train-test-split which separates predictors and predicted variable.

Can I kindly get some feedback? Help is appreciated. thanks

If you use the str() function with your fit object, you will find a part of the output that looks like this

str(logit.mod.2)
.
.
.
$ model            :'data.frame':	2201 obs. of  2 variables:
  ..$ Survived: Factor w/ 2 levels "no","yes": 1 1 1 1 1 1 1 1 1 1 ...
  ..$ Sex     : Factor w/ 2 levels "Female","Male": 2 2 2 2 2 2 2 2 2 2 ...
  ..- attr(*, "terms")=Classes 'terms', 'formula'  language Survived ~ Sex

The names of the factors used in the fit and their levels are stored and can be used in later predictions.

1 Like

First question: Just as you don't have to explicitly dummy-code categorical columns when creating the model, you also don't have to explicitly dummy-code the data you provide to predict. predict(logit.mod.2, test.data, type="response") provides the predicted probability of survived for each observation (each row) in test.data based on whatever the value of sex happens to be in each row.

Second question: No, this is not necessary. predict needs a data frame with the predictor variables used to fit the model, but it's okay if other, unused columns are included as well.

1 Like

That you for the responses, colleagues. They make sense now.

Thanks for this idea. I didn't know about it earlier, regarding the model.

This topic was automatically closed 7 days after the last reply. New replies are no longer allowed.

If you have a query related to it or one of the replies, start a new topic and refer back with a link.