# How does glm.predict deal with dummy variables

Hello colleagues,
I am trying to run a logistic regression on the well known `titanic` dataset. The data has already been split into `train.data` and `test.data`. Moreover, we have done the modeling as per the formula,

``````logit.mod.2 <- glm(survived~sex,data=train.data,family=binomial)
summary(logit.mod.2)
Call:
glm(formula = survived ~ sex, family = binomial, data = train.data)

Deviance Residuals:
Min       1Q   Median       3Q      Max
-1.6824  -0.6705  -0.6705   0.7459   1.7904

Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept)   1.1371     0.1394   8.159 3.38e-16 ***
sexmale      -2.5151     0.1822 -13.807  < 2e-16 ***
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

(Dispersion parameter for binomial family taken to be 1)

Null deviance: 992.97  on 731  degrees of freedom
Residual deviance: 764.46  on 730  degrees of freedom
AIC: 768.46

Number of Fisher Scoring iterations: 4
``````

We see that since it is. formula based method it has automatically created dummy variables for the predictor, `Sex`. Next we use it to predict on the `test.data`.

``````logit.mod.2.preds_test <- predict(logit.mod.2,test.data, type="response")
predicted_label_test_2 <- ifelse(logit.mod.2.preds_test>0.5,1,0)
table(predicted_label_test_2,test.data\$survived)
``````

Now I have 2 questions on it;

• First, the test data does not have any variable by the name, `sexmale`. How will the model know which variable to use for prediction?

• Second, will it be helpful to take out all the columns from test set, except `sex`? Like, in Python we use `train-test-split` which separates predictors and predicted variable.

Can I kindly get some feedback? Help is appreciated. thanks

If you use the str() function with your fit object, you will find a part of the output that looks like this

``````str(logit.mod.2)
.
.
.
\$ model            :'data.frame':	2201 obs. of  2 variables:
..\$ Survived: Factor w/ 2 levels "no","yes": 1 1 1 1 1 1 1 1 1 1 ...
..\$ Sex     : Factor w/ 2 levels "Female","Male": 2 2 2 2 2 2 2 2 2 2 ...
..- attr(*, "terms")=Classes 'terms', 'formula'  language Survived ~ Sex
``````

The names of the factors used in the fit and their levels are stored and can be used in later predictions.

1 Like

First question: Just as you don't have to explicitly dummy-code categorical columns when creating the model, you also don't have to explicitly dummy-code the data you provide to `predict`. `predict(logit.mod.2, test.data, type="response")` provides the predicted probability of `survived` for each observation (each row) in `test.data` based on whatever the value of `sex` happens to be in each row.

Second question: No, this is not necessary. `predict` needs a data frame with the predictor variables used to fit the model, but it's okay if other, unused columns are included as well.

1 Like

That you for the responses, colleagues. They make sense now.

Thanks for this idea. I didn't know about it earlier, regarding the model.

This topic was automatically closed 7 days after the last reply. New replies are no longer allowed.

If you have a query related to it or one of the replies, start a new topic and refer back with a link.