Categorical variables, factors, dummy variables


I understand that, in R, the categorical variables in a dataset can and should be converted to factors using the factor() function. Once a factor, the data in the categorical variable gets "reorganized" into a more efficient type of data structure that R can work with when performing statistical analysis and creating graphs.

In general, R aside, for categorical variables to be used in a regression or ML model, they must be first converted into dummy variables following the rule that if there are N levels, N-1 dummy variables (1s and 0s) must be created.

Does this apply also to factors in R, i.e. do factors need to be converted to dummy variables before applying the variables into a statistical model? For example, Python does not have the concept of factor so there is no choice but converting them into dummy variables....but factors in R are a better version of a regular string categorical variable in Python...

Thank you!

If you are running a regression using lm() (and under some other circumstances as well), R internally converts a factor into the necessary dummy variables.

1 Like

Wow. Nice that R does that for lm(). When would we instead need to convert to dummy variables? So far, all the models I have created use lm()...

Thank you for the reply.

I think most statistical estimates handle this, but I'm not sure about the others.