linear model regression formula

Hi, in the following code, how did the lm call know to include Verb and Math as variables in the formula? Also, what does the I do in I(Verb^2) ? Thank you.

colnames(collgpa)
[1] "ID" "Verb" "Math" "Gpa"
model <- lm(Gpa ~ Verb*Math +I(Verb^2) + I(Math^2), data = collgpa)
summary(model)

Call:
lm(formula = Gpa ~ Verb * Math + I(Verb^2) + I(Math^2), data = collgpa)

Residuals:
Min 1Q Median 3Q Max
-0.50180 -0.05485 0.02719 0.10687 0.35148

Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) -7.2229715 1.4720009 -4.907 2.27e-05 ***
Verb 0.1262617 0.0230892 5.468 4.23e-06 ***
Math 0.1170340 0.0290549 4.028 0.000299 ***
I(Verb^2) -0.0011301 0.0001275 -8.866 2.32e-10 ***
I(Math^2) -0.0010630 0.0001733 -6.135 5.76e-07 ***
Verb:Math 0.0008780 0.0001565 5.611 2.76e-06 ***

The following line tells R to fit a linear model (hence lm()) where GPA is modeled as a function of the interaction between verbal score, math score and the squares of those terms.

model <- lm(Gpa ~ Verb*Math +I(Verb^2) + I(Math^2), data = collgpa)

Specifically, your model is GPA = -7.22 + 0.126 x Verb + 0.117 x Math - 0.00113 x Verb^2 -0.00106 x Math^2 +0.000878 x Math x Verb. Although it only uses Math and Verb as input variables, because of the ^2 terms and the interaction, your resulting linear model has 5 coefficients plus the intercept.

The I() around Verb^2 and Math^2 forces R to treat those as separate variables when fitting the model.

Is that what you're asking?

Thank you Bloosmore, but I don't understand.

Shouldn't I explicitly include Verb and Math in the formula?

And what does I(Verb^2) do that Verb^2 would not? Why can't I just include Verb^2?

When you include an interaction term, lm() automatically includes the separate level terms. ^ has a special meaning in a formula; that's why it needs to be inside the I().

Just to be more explicit on that first point: lm() expands Verb*Math so as to include three terms: Verb , Math and Verb x Math.

And as @startz startz said, the ^ character has a different meaning than "to the power of" when used within formulas such as lm(). So, if you want to include the squared term, it needs to be within the I() -the so-called "AsIs" function. That preserves the intended meaning. See ?I for more details.

You might find this information about formula notation in R useful: https://faculty.chicagobooth.edu/richard.hahn/teaching/formulanotation.pdf

This also has some information on the matter: meetup-presentations_rtp/Slides.pdf at master · rladies/meetup-presentations_rtp · GitHub

Ah, I definitely did not know ^ has a unique meaning inside a formula. Thanks to all.

This topic was automatically closed 7 days after the last reply. New replies are no longer allowed.

If you have a query related to it or one of the replies, start a new topic and refer back with a link.