Multiple regression: Dependent variable not normally distributed

I want to do multiple regression using the dependent variable "posn_neg_5" using a mixture of continuous and binary independent variables. However, posn_neg_5 is not normally distributed (see below). I would like to avoid transformation so that the output remains in units the readers can understand/ interpret.

table(posn_neg_5)
posn_neg_5
1 2 3 4 5
54 29 28 263 468
hist(posn_neg_5)
shapiro.test(posn_neg_5)
Shapiro-Wilk normality test
data: posn_neg_5
W = 0.66667, p-value < 2.2e-16

Should I use 'quantreg'? Do people generally accept use of this?

Thanks,
Stephen

The statistics that are typically calculated along with linear regression assume that residuals are normally distributed. It's not necessarily a problem that your response variable is non-normal - what matters are the residuals. I would go ahead and fit the model you have in mind and test the residuals.

If you still think you require a transformation, then I would do it, and try to use effects plots plotted on the original scale to help your readers understand the relationship between the predictors and the response.

One thing I notice is that your response variable is perhaps on a discrete scale. This might push you to treat it as an ordinal categorical variable and do your regression with polr.

Let me add a bit to @arthur.t 's helpful response. If you have a large sample, then it's not very important that the residuals be normally distributed. Although, as @arthur.t points out, it does matter for auxiliary statistics if the sample is small.

However, when the dependent variable is 0/1, sometimes called a linear probability model, it is likely that you have heteroskedasticity--which does affect the validity of the auxiliary statistics.

Thanks both,
And yes to both points. I was showing ignorance, it's the residuals I need to check.

  1. Normality: Deciding whether a Q-Q plot is adequately normal has always seemed a bit arbitrary. How would I apply a Shapiro-Wilk test directly after the 'Summary' stats?

  2. "polr": I had been concerned about the dependent var not being a true continuous variable. I take it 'polr' would be the safer option?

Using polr (assuming a large sample) has the advantages that

  1. Auxiliary statistics will be right.
  2. Predictions will be between zero and one.

But what you're doing isn't unreasonable and the results are somewhat easier to interpret.

This topic was automatically closed 21 days after the last reply. New replies are no longer allowed.

If you have a query related to it or one of the replies, start a new topic and refer back with a link.