Factor to one hot encoding (aka dummy variables) using logicals

I had to implement one hot encoding for a factor column today.

I'm not sending this to lm (which would directly accept the factor column) but rather creating a "truth table" for QCA. Anyway, it was much harder than I expected, so I wonder if people could check my approach and recommend a better approach? I looked at the tidymodels and recipes approach discussed on r-bloggers but that seemed too heavyweight for a relatively simple need.

I was inspired by this stackexchange thread (see last answer)

library(tidyverse)
tribble(~student_id, ~subject,
                  1, "Maths",
                  2, "Science",
                  3, "English",
                  4, NA_character_) %>% 
  pivot_wider(names_from = subject, 
              values_from = subject, 
              values_fill = list(subject = F),
              values_fn = list(subject = is.character)) %>% 
  select(-`NA`)
#> # A tibble: 4 x 4
#>   student_id Maths Science English
#>        <dbl> <lgl> <lgl>   <lgl>  
#> 1          1 TRUE  FALSE   FALSE  
#> 2          2 FALSE TRUE    FALSE  
#> 3          3 FALSE FALSE   TRUE   
#> 4          4 FALSE FALSE   FALSE

Created on 2020-06-09 by the reprex package (v0.3.0)

It actually does work with repeated ids as well (below I change the second row of data from 2 to 1 and the output correctly has only three rows.

library(tidyverse)
tribble(~student_id, ~subject,
                  1, "Maths",
                  1, "Science",
                  3, "English",
                  4, NA_character_) %>% 
  pivot_wider(names_from = subject, 
              values_from = subject, 
              values_fill = list(subject = F),
              values_fn = list(subject = is.character)) %>% 
  select(-`NA`)
#> # A tibble: 3 x 4
#>   student_id Maths Science English
#>        <dbl> <lgl> <lgl>   <lgl>  
#> 1          1 TRUE  TRUE    FALSE  
#> 2          3 FALSE FALSE   TRUE   
#> 3          4 FALSE FALSE   FALSE

Created on 2020-06-09 by the reprex package (v0.3.0)

1 Like

Hi @jameshowison,

How would this work for you?

# Load libraries ----------------------------------------------------------
library("tidyverse")

# Create example data -----------------------------------------------------
set.seed(416781)
n <- 10
d <- tibble(student_id = sample(x = seq(from = 1, to = 10),
                                size = n,
                                replace = TRUE),
            subject = sample(x = c("Math", "Science", "English"),
                             size = n,
                             replace = TRUE))

# Create one-hot encoding -------------------------------------------------
d_enc <- d %>% 
  mutate(n = 1) %>% 
  distinct %>% 
  pivot_wider(id_cols = student_id,
              names_from = subject,
              values_from = n) %>% 
  mutate_at(vars(matches("Math|Science|English")), replace_na, 0)

Yielding

> d
# A tibble: 10 x 2
   student_id subject
        <int> <chr>  
 1          3 Math   
 2          1 Math   
 3         10 Math   
 4          7 Science
 5          3 Science
 6          4 English
 7          5 English
 8          3 Science
 9          7 Science
10         10 English
> d_enc
# A tibble: 6 x 4
  student_id  Math Science English
       <int> <dbl>   <dbl>   <dbl>
1          3     1       1       0
2          1     1       0       0
3         10     1       0       1
4          7     0       1       0
5          4     0       0       1
6          5     0       0       1

Hope it helps :slightly_smiling_face:

2 Likes

Thanks. I was happy with the T/F encoding, just surprised it was hard :slight_smile:

Playing with what you've come up with suggested another approach using group_by and tally, creating the values_from column :slight_smile:

library("tidyverse")

# Create example data -----------------------------------------------------
set.seed(416781)
n <- 10
d <- tibble(student_id = sample(x = seq(from = 1, to = 10),
                                size = n,
                                replace = TRUE),
            subject = sample(x = c("Math", "Science", "English"),
                             size = n,
                             replace = TRUE))

# Create one-hot encoding -------------------------------------------------

d %>% 
  distinct() %>% # removes extras created by randomness above.
  mutate(subject = as_factor(subject)) %>% 
  group_by(student_id, subject, .drop = F) %>% 
  tally() %>% # creates column n
  pivot_wider(names_from = subject,
              values_from = n)
#> # A tibble: 6 x 4
#> # Groups:   student_id [6]
#>   student_id  Math English Science
#>        <int> <int>   <int>   <int>
#> 1          1     1       1       0
#> 2          3     1       0       0
#> 3          4     0       1       0
#> 4          8     1       0       1
#> 5          9     0       1       0
#> 6         10     1       0       0

Created on 2020-06-09 by the reprex package (v0.3.0)

And in doing that I realized that this can be thought of as a table (aka frequency/contingency table) operation:

library(tidyverse)
tribble(~student_id, ~subject,
                  1, "Maths",
                  1, "Science",
                  3, "English",
                  4, NA_character_) %>% 
  janitor::tabyl(student_id, subject, show_na = F)
#>  student_id English Maths Science
#>           1       0     1       1
#>           3       1     0       0

Created on 2020-06-09 by the reprex package (v0.3.0)

And the results are pretty straightforward to convert to T/F:

library(tidyverse)
tribble(~student_id, ~subject,
                  1, "Maths",
                  1, "Science",
                  3, "English",
                  4, NA_character_) %>% 
  janitor::tabyl(student_id, subject, show_na = F) %>% 
  mutate_at(vars(-student_id), `>`, 0 )
#>   student_id English Maths Science
#> 1          1   FALSE  TRUE    TRUE
#> 2          3    TRUE FALSE   FALSE

Created on 2020-06-09 by the reprex package (v0.3.0)

This topic was automatically closed 21 days after the last reply. New replies are no longer allowed.