Should I center/scale dummy variables?

Hey all,

As a bit of a follow up to my previous question, I've seen disagreements online on whether I should center/scale my dummy variables prior to modeling. (look at the reprexes in the above link for an example). Andrew Gelman seems to say that I shouldn't, but Rob Tibshirani seems to say that I should.

Does anyone have any experience in this? Would it differ on whether I was using glmnet/LASSO vs keras/neural network?
(One of my favorite things about tree-based models like xgboost is that I don't have to think about these issues as much :wink:)

Thanks!

Should I center/scale dummy variables?

Yes, when the model requires the parameters to be on the same scale

  • regularized models (glmnet and like) have penalties on the sum of the slope values
  • nearest neighbor models use distance values and kernel methods (e.g. SVMs) use dot products
  • neural networks usually initialize using random numbers and assume the same scale of predictors
  • PLS models chase covariance and assume that the variances are the same.

and so on.

There is a decent argument to scale them all to a variance of two but, regardless, for some models you will do them harm if you do not normalize them as needed.

Agreed! Low maintenance is the way to go initially.

1 Like

This topic was automatically closed 7 days after the last reply. New replies are no longer allowed.