How to do regression on categorical variables

Hi, can someone please guide or provide code on how to do regression on categorical variables.
Some say we have to introduce dummy variable in the dataset . Kindly can someone guide a newbie like me, how to approach this problem .

A rule of thumb is that 12 or more categories for a single variable can be treated as continuous and do not require creating dummy variables. It is important to be aware, in addition, that continuous response variables, binary response variables and categorical response variables require different modelling designs among one another.

1 Like

Thanks for the reply, but could you please provide some example if i want to use dummy encoding on my dataset. Or provide some link where i can get info. As i want to see the impact of categorical variable on my regression model. As of now i did simple linear regression with two numeric variables.

str(mtcars)
#> 'data.frame':    32 obs. of  11 variables:
#>  $ mpg : num  21 21 22.8 21.4 18.7 18.1 14.3 24.4 22.8 19.2 ...
#>  $ cyl : num  6 6 4 6 8 6 8 4 4 6 ...
#>  $ disp: num  160 160 108 258 360 ...
#>  $ hp  : num  110 110 93 110 175 105 245 62 95 123 ...
#>  $ drat: num  3.9 3.9 3.85 3.08 3.15 2.76 3.21 3.69 3.92 3.92 ...
#>  $ wt  : num  2.62 2.88 2.32 3.21 3.44 ...
#>  $ qsec: num  16.5 17 18.6 19.4 17 ...
#>  $ vs  : num  0 0 1 1 0 1 0 1 1 1 ...
#>  $ am  : num  1 1 1 0 0 0 0 0 0 0 ...
#>  $ gear: num  4 4 4 3 3 3 3 4 4 4 ...
#>  $ carb: num  4 4 1 1 2 1 4 2 2 4 ...
# /begin comment block =========================================
# cyl, vs, am, gear and carb are numerically encoded categorical
# variables
# 
# let mpg be the response variable and drat the independent
# /end
summary(lm(mpg ~ drat, data = mtcars))
#> 
#> Call:
#> lm(formula = mpg ~ drat, data = mtcars)
#> 
#> Residuals:
#>     Min      1Q  Median      3Q     Max 
#> -9.0775 -2.6803 -0.2095  2.2976  9.0225 
#> 
#> Coefficients:
#>             Estimate Std. Error t value Pr(>|t|)    
#> (Intercept)   -7.525      5.477  -1.374     0.18    
#> drat           7.678      1.507   5.096 1.78e-05 ***
#> ---
#> Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
#> 
#> Residual standard error: 4.485 on 30 degrees of freedom
#> Multiple R-squared:  0.464,  Adjusted R-squared:  0.4461 
#> F-statistic: 25.97 on 1 and 30 DF,  p-value: 1.776e-05
# /begin comment block =========================================
# note the F-statistic above
# now add one of the categorical variables as a factor+
# /end
summary(lm(mpg ~ drat + factor(am), data = mtcars))
#> 
#> Call:
#> lm(formula = mpg ~ drat + factor(am), data = mtcars)
#> 
#> Residuals:
#>     Min      1Q  Median      3Q     Max 
#> -9.5802 -2.5206 -0.5153  2.4419  8.5198 
#> 
#> Coefficients:
#>             Estimate Std. Error t value Pr(>|t|)  
#> (Intercept)   -1.950      7.073  -0.276   0.7848  
#> drat           5.811      2.130   2.728   0.0107 *
#> factor(am)1    2.807      2.282   1.230   0.2286  
#> ---
#> Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
#> 
#> Residual standard error: 4.448 on 29 degrees of freedom
#> Multiple R-squared:  0.4906, Adjusted R-squared:  0.4554 
#> F-statistic: 13.96 on 2 and 29 DF,  p-value: 5.659e-05
# note the reduction (improvement) of the F-statistic

Created on 2020-08-29 by the reprex package (v0.3.0)

1 Like

Thanks for the guidance.

This topic was automatically closed 7 days after the last reply. New replies are no longer allowed.

If you have a query related to it or one of the replies, start a new topic and refer back with a link.