should a dummy variable be set as numeric or factor vector? (lme4 function)

Hello

I am relatively new on R and the lme function.

I am running a model which includes some dummy variables (e.g. sex) as predictors. Once I have imported the dataset these variables come as numeric vectors.

I can see that the lme function gives two quite different models, whether I treat these dummy variables as factor or numeric vectors.

I am thus trying to understand why this happens (how does the function treat the two vectors) and how should I make the decision of whether using factor or numeric vector.

Thanks very much

Hi,

I'm not familiar with this specific type of linear model, but I think the difference arises from the encoding of the variables as strings or numbers.

A linear model can only interpret a numeric value on a continuous scale. Categorical data like factors have to be converted into one-hot vectors to be able to fit into a model (many models will do that for you). They work by generating for every factor a separate binary variable that's 0 or 1 if present or absent. This increases the size of the model and can reduce performance if the number of factors is large and the dataset small.

If you present the variable as numeric values, the model will interpret them as if they were on a linear scale. For example, if you say blue = 1, red = 2 and green = 3, the model thinks that blue < red < green. This is nonsensical and although results will be generated, they are not reflecting the actual things that are going on.

It's therefor important to know what inputs the models you use require, and how to generate the correct ones if needed. Again, some models will do automatic conversion of factors to one-hot, but will interpret numeric values always as a continuous variable.

Hope this helps,
PJ

This topic was automatically closed 21 days after the last reply. New replies are no longer allowed.

If you have a query related to it or one of the replies, start a new topic and refer back with a link.