How does lm differentiate polynomial vs multiple regression?

9mm · October 18, 2020, 7:20pm

Let me ask this one last question. If you wanted to run a multiple linear regression AND a polynomial regression using lm() on this exact data:

#> $m2b
#> list(mpg, hp, hp2, wt)

How would you write the 2 commands? And I don't mean just removing the hp2 column before running it, but actually doing a multiple regression best fit line on those 4 columns.

I understand that it is nonsensical to run a "multiple linear regression" on this data which includes hp2, but lets say I wanted to do it anyway, just to see the output or see what happens.

How would I write that?

joels · October 18, 2020, 7:30pm

There's nothing "magic" about lm. It uses a linear algebra algorithm called "QR decomposition" to find the least squares solution for the coefficients. In statistics courses, the mathematics of linear regression is usually first introduced by deriving the linear algebraic formula to invert the design matrix to solve for the regression coefficients. But QR decomposition and various other methods are typically used in practice because they're either faster, more numerically stable, or both. (Maybe this Q&A will be helpful).

I think you're getting hung up on terminology. All of the formulas below are "multiple linear regression" (which just means regression with two or more independent variables) and lm fits them exactly the same way. The first one happens to be a flat-plane function, while the other three (which are exactly the same regression function mathematically) include a second-order polynomial term, and are therefore curved:

mpg ~ hp + wt

mpg ~ hp + hp2 + wt

mpg ~ hp + I(hp^2) + wt

mpg ~ poly(hp, 2, raw=TRUE) + wt

9mm · October 18, 2020, 7:46pm

Damn!... OK, that actually makes lots of sense. Yikes, thats pretty mind blowing for me to get it so wrong.

So what it sounds like is the data then is really "all that matters"... the data is the main driver of how the curve/hyperplane looks, and what the coefficients are, NOT the "type of regression". It sounds like pretty much simple regression, multiple regression, polynomial regression are literally all the same thing...

And the reason that's all that matters is because they're solved for all the same way, therefore really your data columns are the true determiners of the plane being curved or flat, etc.

I was getting stuck because I read so many tutorials and lessons and I've read for weeks about this now, and they all make it a huge point to treat them quite differently, as if they're all solved for differently. They make it worse by repeatedly exposing DIFFERENT formulas over, and over....... and over......and over.

.. it wont let me reply anymore for 1 hour

Does this mean if you have mpg ~ hp + xyz + wt where xyz is not SPECIFICALLY "hp^2" but perhaps some other IV that resembles a series of quickly increasing numbers, does this mean that this plane will ALWAYS be curved to some extent if theres 3+ terms, its just that unless its ^2, and x^3 it wont be very noticeable (ie ALMOST flat but not perfectly flat)?

joels · October 18, 2020, 8:30pm

Not quite. The mathematical form of the regression function is the main driver of the general shape of the regression function. If you fit the model mpg = b_0 + b_1hp + b_2wt, you're going to get a flat plane. If you fit the model mpg = b_0 + b_1hp + b_2hp^2 + b_3wt, you're going to get a parabolic surface (in the hp direction). This is so regardless of whether either model makes logical or scientific sense for your specific problem or whether the data distribution has anything close to the shape of a flat plane or of a parabola. The data is what determines the specific values of the coefficients, given the mathematical function you've chosen to fit. One key aspect of regression modeling is choosing a mathematical model that makes sense (theoretically, empirically, etc.) for the problem you're trying to investigate.

One other misconception I suspect you're having is that when you create a new column like mtcars$hp2 = mtcars$hp^2 or dataset$Level2 = dataset$Level^2 you believe that your data now requires a curved surface as the regression model (due to the squared values we just created). This is not the case! These new squared-value columns are not "data" and they contain no new information that wasn't in your existing data frame. Rather, they are transformations of your existing data. You should include them in your regression only if you, the analyst, think it makes sense to model your data with a function that includes a second-order polynomial term for the Level variable. (And if you do want to include a second order polynomial term, you can do it in the other ways I showed above, rather than by adding a new column to your data.)

It is incorrect to think of these squared (or cubed, etc.) columns as some intrinsic aspect of your data that "requires" a curved regression function. You, the analyst, choose the specific mathematical form of the regression function based on your subject knowledge and your research goals.

9mm · October 18, 2020, 10:31pm

I have edited this so many times.

I think what I might have been stuck on is thinking there could be multiple output linear regressions (curved plane, flat plane) for only a single input dataset, just by changing the formula it uses to compute the coefficients, without changing the dataset and without adding or removing transformation columns

It appears this assumption is false

The bottom line, it seems, is that ultimately I should make sure my dataset columns and transformation columns match the formula of the desired model I want to use, and beyond that I don't need to think too hard about it.

system · October 25, 2020, 10:31pm

This topic was automatically closed 7 days after the last reply. New replies are no longer allowed.

If you have a query related to it or one of the replies, start a new topic and refer back with a link.