Optimizing multiple regression , too many variables- suggestions?

Hey there!

I appreciated the help on my last post regarding normalizing data. I went ahead and did that, and also ran a factor analysis to reduce some variables in my data. Now I am trying to find a good model, but cannot seem to figure it out because I still have too many variables.

A bit of background on what I am trying to do. I took a survey and distributed it to our employees. Each question in the survey measures a behavior or perception. Now I want to see which behaviors and perceptions are correlated with our high performers. So my independent variable is "PerformanceRating" and my dependent variables are "Cognition", "Initiative", "EmotionalStability", and so on.. (even after my factor analysis and combining 4 variables into 1 factor, I still 19 dependent variables...

I ran a regression model with all of the variables to see if could see any that were significant. I found a few, removed all the other variables, but still my rsquared values were not significant.

Do you perhaps have a suggestion on how optimize my model?

Thanks in advance!

I think you have the terms independent variable and dependent variables confused above. Performance rating is your dependent variable as you describe it. You explained in your previous post that the other variables are on likert scales. What does performance rating look like? Is it continuous or also on a scale? What kind of regression model are you using? Why are you trying to model this - are you trying to make predictions or are you trying to understand how to interpret the coefficients or something else?

Here's a thought to consider, it may be that just none of these factors predict performance rating well.

Hey @StatSteph ,

Thanks for the help. Yes, I mixed up in my original post between dependent and independent variables. My dependent variable is performance rating, and independents are behaviors and perceptions. Performance rating is linear 1-5, and we can assume increasing from a 1 to 2 is the same as 4 to 5. As of now, I am only using a simple lm function in R. I am not trying to make predictions, more so trying to understand the coefficients and which have the strongest impact on performance rating. And sure, it could be that none of the behaviors are strong predictors of performance rating

How many observations do you have?

I'm not sure what you mean when you say the rsquared isn't significant. Have you done an F-test for the joint significance of all the variables? The reason I ask is that in this sort of thing it's common that the independent variables do have some explanatory power but that none of them are individually significant because their separate contributions can't be estimated.

@startz ,

I have about 130 observations. And I simply mean that my rsquared and adjusted rsquared are below .3, also with a low pvalue. I havent done an Ftest yet...

Given your dependent variable is on a discrete scale of 1 to 5, I wouldn't recommend using multiple linear regression. You might want to look into ordinal logistic regression. This might be a good starting point: Ordinal Logistic Regression | R Data Analysis Examples

A low p-value suggests that the variables are jointly significant. This may be a case in which the data indicates that the independently variable do jointly have explanatory power but that there isn't a way to separate how much an individual variable matters.

While @StatSteph gives good advice, if you are happy with your specification

we can assume increasing from a 1 to 2 is the same as 4 to 5

then a standard regression is okay.

I'm sorry but I disagree because the assumption of normally distributed residuals simply won't be met.

Normally distributed errors are of almost no consequence. They are not required for the regression to be unbiased. They are not required for the formulas for the coefficient variances to be correct.

Normally distributed errors are required for the coefficients to be normally distributed. But even without normally distributed errors the central limit theorem applies. Since there are 130 observations, the CLT is likely to be applicable.

To interupt the normality argument for a moment, I am not too confident of the stability of any estimates with an n of 130. The more variables the worse the situation is likely to get and those instruments usually have low reliability.

I am not sure of a solution but the OP might want to go heavy on descriptives and plots?

@jrkrideau @startz @StatSteph

I really appreciate all of your input! However, I am a bit more confused than when I began :laughing:

My main issue is that I simply have too many variables and need to remove some before running any sort of regression. I did a PCA and Factor analysis, and was able to reduce a few of the variables into 1 factor. But like I mentioned in the OP, i still have 19 variables. Are there any simple analyses in R that you recommend. @StatSteph recommended Ordinal Regression... can I do that with 19 independent variables...

You can certainly follow @StatSteph's advice and do an ordinal logit with 19 variables. Whether the results will be any "better" would have to be seen.

It's entirely possible that the data is insufficient to identify which of the explanatory factors really matter. If you have a reason to believe that one or more variables are really measuring (more-or-less) the same thing, you can put in just one of those variables and omit the others. Or you could make an index combining the variables. If you do this, you need to remember that what you are identifying is not the effect of the one variable you include, but a proxy for that variable combined with the ones you omit.

If you have 130 rows and 20 columns, perhaps , so long as the data is anonimised it could be shared here?