In need of advice on dimensionality reduction on categoricals: one-hot coding + PCA?

pathos · March 22, 2022, 5:06pm

edit: for future people -- UMAP + kmeans segmentation/cluster assignment may work best, using tidymodels (and its extensions for UMAP and kmeans).

Looking at these two links, there is a bit of contradiction.

you might want to combine PCA with OHE

vs.

PCA does not make sense after one hot encoding

And I thought that the second link makes sense, but that's mostly because the data is synthetically random with an unreal distribution.

What are some methods for reducing dimensions for categorical covariates?

JohnMount · March 22, 2022, 9:35pm

My group supports a package that helps build low dimensional representations of categorical variables in the case of supervised machine learning here: CRAN - Package vtreat (intro here: README ).

pathos · April 6, 2022, 9:17am

Thanks, I'm getting around to looking into it and it seems interesting.

So let's say there are 100 categorical predictors, and vtreat reduces it to 10 predictors. How would its performance change when predicting the y? Looking at the Readme here, I couldn't really find benchmarks on, for example, computation efficiency, consistency, etc. From a UMAP paper, it makes a comparison of consistency for 10% of sample size vs. full data set (on page 35).

I see that there is a video lecture on it, but I could not find a screencast of example coding for categorical dimensional reduction. Do you happen to know one?

JohnMount · April 6, 2022, 3:35pm

vtreat is designed to treat each categorical variable individually- coding each to a small number of columns instead of exploding to a huge number of indicators. The joint dimension reduction of these produced columns are not part of the scope of vtreat, and left to other tools. A longer discussion of vtreat with measured performance on some examples can be found here: [1611.09477] vtreat: a data.frame Processor for Predictive Modeling .

hnagaty · April 12, 2022, 7:23am

You can have a look at the methods implemented in FactoMineR. They have a book, a bunch of YouTube videos and an online MOOC. I'm currently attending the MOOC and find it very useful.