Using sparse.model.matrix
from the Matrix
package you can get dummy-variables (now more trendily called one-hot encoding) for factor or factor-like columns of a data frame.
I found some useful commentary on Stack Exchange:
When you have "K" dummy variables then your resulting model will have a.) the intercept term (which is a column of ones) and b.) "K-1" additional columns. The reason is because otherwise the columns of the resulting matrix would not be linearly independent (and, as a result, you wouldn't be able to do OLS ). – Steve S Oct 1 '15 at 5:34
Skip a few, then:
@SteveS: In fact R's so friendly that if you try remove the intercept
-1
when you have a single categorical predictor represented as a factor (as in this question), it'll assume you don't really mean that & switch to using sum-to-zero coding; which is of course just a different parametrization. Too friendly, if you ask me. – Scortchi♦Oct 1 '15 at 8:56
My purpose is not regression and I want to get the full set of dummy-variables, without the inserted intercept variable. Can anyone tell me how to do that with sparse.model.matrix
?
Here's a reprex
illustrating the two options. Da
is missing from the final example.
library(Matrix)
library(magrittr)
# Two numeric variables
# Two factor-like variables, three factors each (k=6)
df <- data.frame(
stringsAsFactors = FALSE,
A = c(1, 1, 1, 2, 2, 2, 3, 3, 3),
B = c(1, 2, 3, 1, 2, 3, 1, 2, 3),
C = c("a", "a", "a", "b", "b", "b", "c", "c", "c"),
D = c("a", "b", "c", "a", "b", "c", "a", "b", "c")
)
str(df)
#> 'data.frame': 9 obs. of 4 variables:
#> $ A: num 1 1 1 2 2 2 3 3 3
#> $ B: num 1 2 3 1 2 3 1 2 3
#> $ C: chr "a" "a" "a" "b" ...
#> $ D: chr "a" "b" "c" "a" ...
# All-ones intercept variable inserted
# k-2 dummy-variables
df %>% sparse.model.matrix(~., .)
#> 9 x 7 sparse Matrix of class "dgCMatrix"
#> (Intercept) A B Cb Cc Db Dc
#> 1 1 1 1 . . . .
#> 2 1 1 2 . . 1 .
#> 3 1 1 3 . . . 1
#> 4 1 2 1 1 . . .
#> 5 1 2 2 1 . 1 .
#> 6 1 2 3 1 . . 1
#> 7 1 3 1 . 1 . .
#> 8 1 3 2 . 1 1 .
#> 9 1 3 3 . 1 . 1
# No intercept variable inserted
# k-1 dummy-variables
df %>% sparse.model.matrix(~.-1, .)
#> 9 x 7 sparse Matrix of class "dgCMatrix"
#> A B Ca Cb Cc Db Dc
#> 1 1 1 1 . . . .
#> 2 1 2 1 . . 1 .
#> 3 1 3 1 . . . 1
#> 4 2 1 . 1 . . .
#> 5 2 2 . 1 . 1 .
#> 6 2 3 . 1 . . 1
#> 7 3 1 . . 1 . .
#> 8 3 2 . . 1 1 .
#> 9 3 3 . . 1 . 1
Created on 2019-01-15 by the reprex package (v0.2.1)