How to add the reference variable in the linear model?

I am new to this topic. I just learnt about this today and I am having trouble understanding the below concept .

I have a variable with 3 levels or categories (Rural, Urban, Semi-Urban).

In order to change the data type from categorical variable to numeric variable, I use the concept of dummy variables.

I understand how the concept of creating k-1 dummy variables. Being k=number of levels or categories.

So I create 2 dummy variables (Rural and Urban). The observations which has 0 in both the dummy variables is the third level or category (Semi-Urban).

Now,

How do I create a linear model (glm) using the third level or category?

I created a model with first and second (Rural and Urban). I came to know that the variable is significant for the model.

If I want to know the significance of Semi-Urban, what should I do? How to include it?

The problem with your request is that you want to know if a category is significant by itself. That isn't a meaningful concept. Categorical variables are only significant relative to other levels. This is similar to the idea that a continuous variable is not significant at a single value; it is the change in the variable that is significant.
In the example below, I invent some data where Value has one distribution when Name is A and another distribution when Name is either B or C. When I do a regression with A as the baseline, both B and C are significant. When I change the baseline Name to B, A is significant but C is not. This does not mean that C has suddenly lost its significance. It just means that switching from B to C does not change the value of the Value variable. This was apparent in the original fit where you can see that the Estimates of B and C are very close. Both fits show, in different ways, that B and C are different from A but not from each other.

DF <- data.frame(Name = rep(c("A", "B", "C"), each = 50),
                 Value = c(rnorm(50, 0, 0.2), rnorm(100, 1, 0.5)))
summary(lm(Value ~ Name, data = DF))
#> 
#> Call:
#> lm(formula = Value ~ Name, data = DF)
#> 
#> Residuals:
#>      Min       1Q   Median       3Q      Max 
#> -1.52319 -0.23338 -0.00452  0.24193  1.51990 
#> 
#> Coefficients:
#>             Estimate Std. Error t value Pr(>|t|)    
#> (Intercept)  0.01283    0.06701   0.191    0.848    
#> NameB        0.97715    0.09476  10.312   <2e-16 ***
#> NameC        1.07792    0.09476  11.375   <2e-16 ***
#> ---
#> Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
#> 
#> Residual standard error: 0.4738 on 147 degrees of freedom
#> Multiple R-squared:  0.5179, Adjusted R-squared:  0.5113 
#> F-statistic: 78.95 on 2 and 147 DF,  p-value: < 2.2e-16

DF$Name <- factor(DF$Name, levels = c("B", "C", "A"))
summary(lm(Value ~ Name, data = DF))
#> 
#> Call:
#> lm(formula = Value ~ Name, data = DF)
#> 
#> Residuals:
#>      Min       1Q   Median       3Q      Max 
#> -1.52319 -0.23338 -0.00452  0.24193  1.51990 
#> 
#> Coefficients:
#>             Estimate Std. Error t value Pr(>|t|)    
#> (Intercept)  0.98998    0.06701  14.775   <2e-16 ***
#> NameC        0.10077    0.09476   1.063    0.289    
#> NameA       -0.97715    0.09476 -10.312   <2e-16 ***
#> ---
#> Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
#> 
#> Residual standard error: 0.4738 on 147 degrees of freedom
#> Multiple R-squared:  0.5179, Adjusted R-squared:  0.5113 
#> F-statistic: 78.95 on 2 and 147 DF,  p-value: < 2.2e-16

Created on 2022-06-11 by the reprex package (v2.0.1)

1 Like

This topic was automatically closed 21 days after the last reply. New replies are no longer allowed.

If you have a query related to it or one of the replies, start a new topic and refer back with a link.