Linear Regression With 3 Levels of IV only Outputting Result for 2 Levels

I'm struggling to get the summary() function to show me the output of my regression for all 3 levels of my one IV in R Studio. I've tried everything. How do I get results for all levels of the IV using the lm() and summary() functions?

I think what you are seeing are regression results with the "oldest" category being used as the baseline. It is alphabetically the first category and therefore the lowest factor level. The estimated response for singleton is the offset between that level and oldest. Does that make sense?

An example where the levels are A, B and C. A is not in the lm output.

DF <- data.frame(Name = sample(LETTERS[1:3], 99, replace = TRUE), Value = runif(99, 0, 100))
lm(Value ~ Name, data = DF)
#> 
#> Call:
#> lm(formula = Value ~ Name, data = DF)
#> 
#> Coefficients:
#> (Intercept)        NameB        NameC  
#>      53.004        1.605      -11.670

Created on 2019-12-01 by the reprex package (v0.3.0.9000)

There is nothing to fix. You can think of the \beta estimate for the oldest group as being 0 and that compared to oldest siblings, singletons have 2.98 lower HEI, on average, and compared to oldest siblings, youngest siblings have 2.67 lower HEI.

Hey Steph, thank you for taking the time to explain, that makes sense! A better way to frame my question would be is there a way to set the baseline as one of the other levels rather than the "oldest". I've been looking into the grepl function?

You can use the factor function to set the order of the levels in your data. Here is an example where at first the levels are in the order A, B, C and then are reset to C, B, A. The second lm() uses C as the baseline.

set.seed(1)
DF <- data.frame(Name = sample(LETTERS[1:3], 99, replace = TRUE), Value = runif(99, 0, 100))
lm(Value ~ Name, data = DF)
#> 
#> Call:
#> lm(formula = Value ~ Name, data = DF)
#> 
#> Coefficients:
#> (Intercept)        NameB        NameC  
#>      59.662      -12.505       -9.107
#Reset the levels of Name to have the order C, B, A
DF$Name <- factor(DF$Name,levels = c("C", "B", "A"))
lm(Value ~ Name, data = DF)
#> 
#> Call:
#> lm(formula = Value ~ Name, data = DF)
#> 
#> Coefficients:
#> (Intercept)        NameB        NameA  
#>      50.555       -3.398        9.107

Created on 2019-12-01 by the reprex package (v0.3.0.9000)

1 Like

Alternatively, you could turn off the intercept (which by default subsumes the baseline first level in the factor) and get all three effects like this:

set.seed(1)
DF <- data.frame(Name = sample(LETTERS[1:3], 99, replace = TRUE), Value = runif(99, 0, 100))
lm(Value ~ Name, data = DF)
#> 
#> Call:
#> lm(formula = Value ~ Name, data = DF)
#> 
#> Coefficients:
#> (Intercept)        NameB        NameC  
#>      46.619        7.257        7.877

#DF$Name <- factor(DF$Name,levels = c("C", "B", "A"))
lm(Value ~ Name -1, data = DF)
#> 
#> Call:
#> lm(formula = Value ~ Name - 1, data = DF)
#> 
#> Coefficients:
#> NameA  NameB  NameC  
#> 46.62  53.88  54.50

Created on 2019-12-02 by the reprex package (v0.3.0)

You can now see the relationships between the 2 ways of doing it - with the intercept, the value of the intercept (46.62) corresponds to the first level factor, the coefficient on NameB is 46.62 + 7.26 = 53.88 (the value of the same coefficient in the regression without an intercept) and analogously with the coefficient for NameC.

The reason you can't have both an intercept and all three factor levels is that the intercept is represented as a column of ones in the regressor matrix. Since each observation falls into one of the three categories (A, B or C) - which are represented as 3 dummy variable (0/1) columns, a model with an intercept and all three dummy variables results in a linear dependency which is not allowed since the X'X matrix has a reduced rank and is not invertible (and invertibility is requirement for a non-degenerate linear least squares model).