Multicategorical Independent Variable in Regression to several dummies variables or not

I have a doubt that is somehow offtopic.

I want to do a logit regression with a multicategorical INdependent variable with N categories.

I'm forced to create N-1 dummy variables or can I keep the original multicategorical INdependent variable?

If I have to create the dummies variables is there any code available?

Thank you for your help.


Most regression functions in R do that automatically for you. Here is an example

#> [1] "setosa"     "versicolor" "virginica"
glm("Petal.Width ~ .", data = iris)
#> Call:  glm(formula = "Petal.Width ~ .", data = iris)
#> Coefficients:
#>       (Intercept)       Sepal.Length        Sepal.Width       Petal.Length  
#>          -0.47314           -0.09293            0.24220            0.24220  
#> Speciesversicolor   Speciesvirginica  
#>           0.64811            1.04637  
#> Degrees of Freedom: 149 Total (i.e. Null);  144 Residual
#> Null Deviance:       86.57 
#> Residual Deviance: 3.998     AIC: -104.1

Created on 2021-12-09 by the reprex package (v2.0.1)

You can see that for the Species column 2 variables were created (there are 3 possible values).

Hope this helps,

1 Like

You generally want the dummies. The exception is if you believe the "distance" between categories is equal for all categories, that is going from level A to level B would have the same effect as going from B to C.

1 Like

Thank you @pieterjanvc

1 Like

Thank you for you answer @startz

This topic was automatically closed 7 days after the last reply. New replies are no longer allowed.

If you have a query related to it or one of the replies, start a new topic and refer back with a link.