Linear Regression With 3 Levels of IV only Outputting Result for 2 Levels

StatsWiz · December 2, 2019, 12:40am

I'm struggling to get the summary() function to show me the output of my regression for all 3 levels of my one IV in R Studio. I've tried everything. How do I get results for all levels of the IV using the lm() and summary() functions?

FJCC · December 2, 2019, 12:45am

I think what you are seeing are regression results with the "oldest" category being used as the baseline. It is alphabetically the first category and therefore the lowest factor level. The estimated response for singleton is the offset between that level and oldest. Does that make sense?

FJCC · December 2, 2019, 12:48am

An example where the levels are A, B and C. A is not in the lm output.

DF <- data.frame(Name = sample(LETTERS[1:3], 99, replace = TRUE), Value = runif(99, 0, 100))
lm(Value ~ Name, data = DF)
#> 
#> Call:
#> lm(formula = Value ~ Name, data = DF)
#> 
#> Coefficients:
#> (Intercept)        NameB        NameC  
#>      53.004        1.605      -11.670

^{Created on 2019-12-01 by the reprex package (v0.3.0.9000)}

StatSteph · December 2, 2019, 1:14am

There is nothing to fix. You can think of the \beta estimate for the oldest group as being 0 and that compared to oldest siblings, singletons have 2.98 lower HEI, on average, and compared to oldest siblings, youngest siblings have 2.67 lower HEI.

StatsWiz · December 2, 2019, 1:20am

Hey Steph, thank you for taking the time to explain, that makes sense! A better way to frame my question would be is there a way to set the baseline as one of the other levels rather than the "oldest". I've been looking into the grepl function?

FJCC · December 2, 2019, 2:34am

You can use the factor function to set the order of the levels in your data. Here is an example where at first the levels are in the order A, B, C and then are reset to C, B, A. The second lm() uses C as the baseline.

set.seed(1)
DF <- data.frame(Name = sample(LETTERS[1:3], 99, replace = TRUE), Value = runif(99, 0, 100))
lm(Value ~ Name, data = DF)
#> 
#> Call:
#> lm(formula = Value ~ Name, data = DF)
#> 
#> Coefficients:
#> (Intercept)        NameB        NameC  
#>      59.662      -12.505       -9.107
#Reset the levels of Name to have the order C, B, A
DF$Name <- factor(DF$Name,levels = c("C", "B", "A"))
lm(Value ~ Name, data = DF)
#> 
#> Call:
#> lm(formula = Value ~ Name, data = DF)
#> 
#> Coefficients:
#> (Intercept)        NameB        NameA  
#>      50.555       -3.398        9.107

^{Created on 2019-12-01 by the reprex package (v0.3.0.9000)}

valeri · December 2, 2019, 12:04pm

Alternatively, you could turn off the intercept (which by default subsumes the baseline first level in the factor) and get all three effects like this:

set.seed(1)
DF <- data.frame(Name = sample(LETTERS[1:3], 99, replace = TRUE), Value = runif(99, 0, 100))
lm(Value ~ Name, data = DF)
#> 
#> Call:
#> lm(formula = Value ~ Name, data = DF)
#> 
#> Coefficients:
#> (Intercept)        NameB        NameC  
#>      46.619        7.257        7.877

#DF$Name <- factor(DF$Name,levels = c("C", "B", "A"))
lm(Value ~ Name -1, data = DF)
#> 
#> Call:
#> lm(formula = Value ~ Name - 1, data = DF)
#> 
#> Coefficients:
#> NameA  NameB  NameC  
#> 46.62  53.88  54.50

Created on 2019-12-02 by the reprex package (v0.3.0)

You can now see the relationships between the 2 ways of doing it - with the intercept, the value of the intercept (46.62) corresponds to the first level factor, the coefficient on NameB is 46.62 + 7.26 = 53.88 (the value of the same coefficient in the regression without an intercept) and analogously with the coefficient for NameC.

The reason you can't have both an intercept and all three factor levels is that the intercept is represented as a column of ones in the regressor matrix. Since each observation falls into one of the three categories (A, B or C) - which are represented as 3 dummy variable (0/1) columns, a model with an intercept and all three dummy variables results in a linear dependency which is not allowed since the X'X matrix has a reduced rank and is not invertible (and invertibility is requirement for a non-degenerate linear least squares model).

system · December 23, 2019, 12:04pm

This topic was automatically closed 21 days after the last reply. New replies are no longer allowed.