How to get a full set of dummy-variables

Using sparse.model.matrix from the Matrix package you can get dummy-variables (now more trendily called one-hot encoding) for factor or factor-like columns of a data frame.

I found some useful commentary on Stack Exchange:

When you have "K" dummy variables then your resulting model will have a.) the intercept term (which is a column of ones) and b.) "K-1" additional columns. The reason is because otherwise the columns of the resulting matrix would not be linearly independent (and, as a result, you wouldn't be able to do OLS ). – Steve S Oct 1 '15 at 5:34

Skip a few, then:

@SteveS: In fact R's so friendly that if you try remove the intercept -1 when you have a single categorical predictor represented as a factor (as in this question), it'll assume you don't really mean that & switch to using sum-to-zero coding; which is of course just a different parametrization. Too friendly, if you ask me. – Scortchi♦Oct 1 '15 at 8:56

My purpose is not regression and I want to get the full set of dummy-variables, without the inserted intercept variable. Can anyone tell me how to do that with sparse.model.matrix?

Here's a reprex illustrating the two options. Da is missing from the final example.

library(Matrix)
library(magrittr)

# Two numeric variables
# Two factor-like variables, three factors each (k=6)
df <- data.frame(
    stringsAsFactors = FALSE,
    A = c(1, 1, 1, 2, 2, 2, 3, 3, 3),
    B = c(1, 2, 3, 1, 2, 3, 1, 2, 3),
    C = c("a", "a", "a", "b", "b", "b", "c", "c", "c"),
    D = c("a", "b", "c", "a", "b", "c", "a", "b", "c")
)
str(df)
#> 'data.frame':    9 obs. of  4 variables:
#>  $ A: num  1 1 1 2 2 2 3 3 3
#>  $ B: num  1 2 3 1 2 3 1 2 3
#>  $ C: chr  "a" "a" "a" "b" ...
#>  $ D: chr  "a" "b" "c" "a" ...

# All-ones intercept variable inserted
# k-2 dummy-variables
df %>% sparse.model.matrix(~., .)
#> 9 x 7 sparse Matrix of class "dgCMatrix"
#>   (Intercept) A B Cb Cc Db Dc
#> 1           1 1 1  .  .  .  .
#> 2           1 1 2  .  .  1  .
#> 3           1 1 3  .  .  .  1
#> 4           1 2 1  1  .  .  .
#> 5           1 2 2  1  .  1  .
#> 6           1 2 3  1  .  .  1
#> 7           1 3 1  .  1  .  .
#> 8           1 3 2  .  1  1  .
#> 9           1 3 3  .  1  .  1

# No intercept variable inserted
# k-1 dummy-variables
df %>% sparse.model.matrix(~.-1, .)
#> 9 x 7 sparse Matrix of class "dgCMatrix"
#>   A B Ca Cb Cc Db Dc
#> 1 1 1  1  .  .  .  .
#> 2 1 2  1  .  .  1  .
#> 3 1 3  1  .  .  .  1
#> 4 2 1  .  1  .  .  .
#> 5 2 2  .  1  .  1  .
#> 6 2 3  .  1  .  .  1
#> 7 3 1  .  .  1  .  .
#> 8 3 2  .  .  1  1  .
#> 9 3 3  .  .  1  .  1

Created on 2019-01-15 by the reprex package (v0.2.1)

1 Like

I found the answer in a Stack Overflow post (NB. date)

You need to reset the contrasts for the factor variables.

X.factors = 
  model.matrix( ~ ., data=X, contrasts.arg = 
    lapply(data.frame(X[,sapply(data.frame(X), is.factor)]),
                                             contrasts, contrasts = FALSE))

Here's my updated reprex

library(Matrix)
library(magrittr)

df <- data.frame(
    stringsAsFactors = FALSE,
    A = c(1, 1, 1, 2, 2, 2, 3, 3, 3),
    B = c(1, 2, 3, 1, 2, 3, 1, 2, 3),
    C = c("a", "a", "a", "b", "b", "b", "c", "c", "c"),
    D = c("a", "b", "c", "a", "b", "c", "a", "b", "c")) 
df
#>   A B C D
#> 1 1 1 a a
#> 2 1 2 a b
#> 3 1 3 a c
#> 4 2 1 b a
#> 5 2 2 b b
#> 6 2 3 b c
#> 7 3 1 c a
#> 8 3 2 c b
#> 9 3 3 c c

# We want all character as factor
# so let's do a generic conversion
df %<>%
    type.convert(as.is = FALSE)

# One-hot with reset contrasts
df %<>% {
    sparse.model.matrix(~ . - 1, .,
        drop.unused.levels = TRUE,
        contrasts.arg = lapply(.[, sapply(., is.factor)], 
                               contrasts, contrasts = FALSE))}
df
#> 9 x 8 sparse Matrix of class "dgCMatrix"
#>   A B Ca Cb Cc Da Db Dc
#> 1 1 1  1  .  .  1  .  .
#> 2 1 2  1  .  .  .  1  .
#> 3 1 3  1  .  .  .  .  1
#> 4 2 1  .  1  .  1  .  .
#> 5 2 2  .  1  .  .  1  .
#> 6 2 3  .  1  .  .  .  1
#> 7 3 1  .  .  1  1  .  .
#> 8 3 2  .  .  1  .  1  .
#> 9 3 3  .  .  1  .  .  1

Created on 2019-01-16 by the reprex package (v0.2.1)

This topic was automatically closed 7 days after the last reply. New replies are no longer allowed.