How does glm.predict deal with dummy variables

Hello colleagues,
I am trying to run a logistic regression on the well known titanic dataset. The data has already been split into and Moreover, we have done the modeling as per the formula,

logit.mod.2 <- glm(survived~sex,,family=binomial)
glm(formula = survived ~ sex, family = binomial, data =

Deviance Residuals: 
    Min       1Q   Median       3Q      Max  
-1.6824  -0.6705  -0.6705   0.7459   1.7904  

            Estimate Std. Error z value Pr(>|z|)    
(Intercept)   1.1371     0.1394   8.159 3.38e-16 ***
sexmale      -2.5151     0.1822 -13.807  < 2e-16 ***
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

(Dispersion parameter for binomial family taken to be 1)

    Null deviance: 992.97  on 731  degrees of freedom
Residual deviance: 764.46  on 730  degrees of freedom
AIC: 768.46

Number of Fisher Scoring iterations: 4

We see that since it is. formula based method it has automatically created dummy variables for the predictor, Sex. Next we use it to predict on the

logit.mod.2.preds_test <- predict(logit.mod.2,, type="response")
predicted_label_test_2 <- ifelse(logit.mod.2.preds_test>0.5,1,0)

Now I have 2 questions on it;

  • First, the test data does not have any variable by the name, sexmale. How will the model know which variable to use for prediction?

  • Second, will it be helpful to take out all the columns from test set, except sex? Like, in Python we use train-test-split which separates predictors and predicted variable.

Can I kindly get some feedback? Help is appreciated. thanks

If you use the str() function with your fit object, you will find a part of the output that looks like this

$ model            :'data.frame':	2201 obs. of  2 variables:
  ..$ Survived: Factor w/ 2 levels "no","yes": 1 1 1 1 1 1 1 1 1 1 ...
  ..$ Sex     : Factor w/ 2 levels "Female","Male": 2 2 2 2 2 2 2 2 2 2 ...
  ..- attr(*, "terms")=Classes 'terms', 'formula'  language Survived ~ Sex

The names of the factors used in the fit and their levels are stored and can be used in later predictions.

First question: Just as you don't have to explicitly dummy-code categorical columns when creating the model, you also don't have to explicitly dummy-code the data you provide to predict. predict(logit.mod.2,, type="response") provides the predicted probability of survived for each observation (each row) in based on whatever the value of sex happens to be in each row.

Second question: No, this is not necessary. predict needs a data frame with the predictor variables used to fit the model, but it's okay if other, unused columns are included as well.

That you for the responses, colleagues. They make sense now.

Thanks for this idea. I didn't know about it earlier, regarding the model.

