Hi, can someone please guide or provide code on how to do regression on categorical variables.
Some say we have to introduce dummy variable in the dataset . Kindly can someone guide a newbie like me, how to approach this problem .
A rule of thumb is that 12 or more categories for a single variable can be treated as continuous and do not require creating dummy variables. It is important to be aware, in addition, that continuous response variables, binary response variables and categorical response variables require different modelling designs among one another.
Thanks for the reply, but could you please provide some example if i want to use dummy encoding on my dataset. Or provide some link where i can get info. As i want to see the impact of categorical variable on my regression model. As of now i did simple linear regression with two numeric variables.
str(mtcars)
#> 'data.frame': 32 obs. of 11 variables:
#> $ mpg : num 21 21 22.8 21.4 18.7 18.1 14.3 24.4 22.8 19.2 ...
#> $ cyl : num 6 6 4 6 8 6 8 4 4 6 ...
#> $ disp: num 160 160 108 258 360 ...
#> $ hp : num 110 110 93 110 175 105 245 62 95 123 ...
#> $ drat: num 3.9 3.9 3.85 3.08 3.15 2.76 3.21 3.69 3.92 3.92 ...
#> $ wt : num 2.62 2.88 2.32 3.21 3.44 ...
#> $ qsec: num 16.5 17 18.6 19.4 17 ...
#> $ vs : num 0 0 1 1 0 1 0 1 1 1 ...
#> $ am : num 1 1 1 0 0 0 0 0 0 0 ...
#> $ gear: num 4 4 4 3 3 3 3 4 4 4 ...
#> $ carb: num 4 4 1 1 2 1 4 2 2 4 ...
# /begin comment block =========================================
# cyl, vs, am, gear and carb are numerically encoded categorical
# variables
#
# let mpg be the response variable and drat the independent
# /end
summary(lm(mpg ~ drat, data = mtcars))
#>
#> Call:
#> lm(formula = mpg ~ drat, data = mtcars)
#>
#> Residuals:
#> Min 1Q Median 3Q Max
#> -9.0775 -2.6803 -0.2095 2.2976 9.0225
#>
#> Coefficients:
#> Estimate Std. Error t value Pr(>|t|)
#> (Intercept) -7.525 5.477 -1.374 0.18
#> drat 7.678 1.507 5.096 1.78e-05 ***
#> ---
#> Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
#>
#> Residual standard error: 4.485 on 30 degrees of freedom
#> Multiple R-squared: 0.464, Adjusted R-squared: 0.4461
#> F-statistic: 25.97 on 1 and 30 DF, p-value: 1.776e-05
# /begin comment block =========================================
# note the F-statistic above
# now add one of the categorical variables as a factor+
# /end
summary(lm(mpg ~ drat + factor(am), data = mtcars))
#>
#> Call:
#> lm(formula = mpg ~ drat + factor(am), data = mtcars)
#>
#> Residuals:
#> Min 1Q Median 3Q Max
#> -9.5802 -2.5206 -0.5153 2.4419 8.5198
#>
#> Coefficients:
#> Estimate Std. Error t value Pr(>|t|)
#> (Intercept) -1.950 7.073 -0.276 0.7848
#> drat 5.811 2.130 2.728 0.0107 *
#> factor(am)1 2.807 2.282 1.230 0.2286
#> ---
#> Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
#>
#> Residual standard error: 4.448 on 29 degrees of freedom
#> Multiple R-squared: 0.4906, Adjusted R-squared: 0.4554
#> F-statistic: 13.96 on 2 and 29 DF, p-value: 5.659e-05
# note the reduction (improvement) of the F-statistic
Created on 2020-08-29 by the reprex package (v0.3.0)
Thanks for the guidance.
This topic was automatically closed 7 days after the last reply. New replies are no longer allowed.
If you have a query related to it or one of the replies, start a new topic and refer back with a link.