Among three factor variables, what are some approaches to find the best of the three variables to predict Y?

Let's say there are
a) three factor covariates
b) Y variable, whether it be continuous or binary

If I only had to choose one covariate and PCA the other two (just for sake of ideation, not real modelling) to predict the Y variable, it would make sense that the one covariate shows least amount of error with the Y variable, correct (assuming there wouldn't be overfit)?

So then what would be the best approach to find this one covariate? Correlation? Regression/logit after one-hot coding? Something else?

I would first check the VIF in a model with all variables. If the VIFs are all under 5, I would include an OLS model with standard errors and p-values in my analysis.

If the VIFs are too high, I would fit an elastic net model with k-fold cross validation and report standardized coefficients. Perhaps bootstrap these coefficients.

As for the PCA on factor variables - you can't perform PCA on the factor variables themselves. You can perform it on their dummy variables, but that seems a little strange. Not sure if that will be fruitful for you. Having the principal components in the model will surely take away from the descriptive value of the model. That may be OK if it has a benefit for the predictive value, but I'm not sure that it will, assuming you use a model that can control overfitting (like elastic net).

1 Like

This topic was automatically closed 21 days after the last reply. New replies are no longer allowed.

If you have a query related to it or one of the replies, start a new topic and refer back with a link.