In need of advice on dimensionality reduction on categoricals: one-hot coding + PCA?

Looking at these two links, there is a bit of contradiction.

you might want to combine PCA with OHE


PCA does not make sense after one hot encoding

And I thought that the second link makes sense, but that's mostly because the data is synthetically random with an unreal distribution.

What are some methods for reducing dimensions for categorical covariates?

My group supports a package that helps build low dimensional representations of categorical variables in the case of supervised machine learning here: CRAN - Package vtreat (intro here: README ).

1 Like

Thanks, I'm getting around to looking into it and it seems interesting.

So let's say there are 100 categorical predictors, and vtreat reduces it to 10 predictors. How would its performance change when predicting the y? Looking at the Readme here, I couldn't really find benchmarks on, for example, computation efficiency, consistency, etc. From a UMAP paper, it makes a comparison of consistency for 10% of sample size vs. full data set (on page 35).

I see that there is a video lecture on it, but I could not find a screencast of example coding for categorical dimensional reduction. Do you happen to know one?

vtreat is designed to treat each categorical variable individually- coding each to a small number of columns instead of exploding to a huge number of indicators. The joint dimension reduction of these produced columns are not part of the scope of vtreat, and left to other tools. A longer discussion of vtreat with measured performance on some examples can be found here: [1611.09477] vtreat: a data.frame Processor for Predictive Modeling .

1 Like

You can have a look at the methods implemented in FactoMineR. They have a book, a bunch of YouTube videos and an online MOOC. I'm currently attending the MOOC and find it very useful.