Intepretting OLS graphs for quality of a Linear Regression Model

I am working on using a wine quality dataset (Wine Quality Dataset | Kaggle)

using the alcohol content as the dependent variables

I am creating a manual regression model - IE removing values I think have less impact and then comparing that to a Backwards stepwise regression to determine if my model is better than the created one.

Looking at pure F-stat values as well as R^2 and degrees of freedom I want to say that my model is better but I am having trouble interpreting the OLS graphs

My manual model has an F-stat of 374 with 1135 DF and a R^2 of .6968 where the stepwise regression has an F-stat of 274.4 with 1132 DF and a R^2 of .7086

I removed Chlorides, Free and Total Sulfur dioxide and the Stepwise model only removed Free Sulfur Dioxide

From what I understand - which is pretty basic as I am just learning the language;

Normality which is checked on the top right seems to be accurate until the top end with some outlier values

Linearity which is checked in the top Left seems to follow a semi linear pattern but there are a lot more outlier variables – the curve from the residual which seems to increase as the fitted Y value increases means we can assume that heteroscedasticity exists.

Looking at homoscedasticity on the bottom left we can see that it follows the median line pretty well with only some minor deviation – typically if we saw a random distribution of values and a flat red line, we are sure there is no heteroscedasticity – as this is not the case, we know that heteroscedasticity exists

As we see possibly outliers, we can say that the OLS assumptions are not met based on our testing.

but I am hoping that some of you with much more knowledge might be able to better explain what I am looking at

A few thoughts, I'm sure others will chime in as well.

(1) Normality is of no importance in a regression, especially if there are a large number of observations as is the case here.

(2) Checking for outliers is important, but what to do isn't obvious. Points with a lot of leverage are particularly influential in determining the estimates. If those points don't obey the model, that's bad. But if they do obey the model, that's really good. So you may want to use what you've found as a diagnostic to ask if there is something odd about those points, but the diagnostic itself doesn't answer the question of whether or not they should be kept.

There are some outlier variables that I have checked but it nothing too important.

I think my largest issue is deciding whether or not the variables I removed made a better model as that the end goal - creating a model that is more accurate than the one generated by the backwards stepwise regression.

There is essentially no difference in your fit.

You might want to randomly take out 100 or so observations, re-estimate both ways, then use the estimated models to forecast the hold-out data and see if there is any difference.

This topic was automatically closed 21 days after the last reply. New replies are no longer allowed.

If you have a query related to it or one of the replies, start a new topic and refer back with a link.