Creating "dummy" columns based on multiple columns

GreyMerchant · March 26, 2020, 6:51am

Hi,

I was trying to find a quick way to do this yesterday via tidyverse but I wasn't able to do so. I want to essentially create dummy columns but on the input of three columns.

You will see that we have a data frame below. As an example with row 12 we have the values 1,12,11 respective to a,b,c. Thus, the new column "1" should have a 1 there and all other columns should contain 0 until we get to 11 and 12 for that row which also contain a 1 for their respective columns. Just to add a second example, row 1 which is 14,6,10 should have 0's for all values in the new columns on their first row expect for 6,10,14 which should have a 1.

I sort of expect that a lot of mutates will have to happen to create the range but not sure how to write the logical expression concisely to be as little as possible code.

set.seed(333)

df <- data.frame(a = sample(1:17,15),
                 b = sample(1:17,15),
                 c = sample(1:17,15))

nirgrahamuk · March 26, 2020, 9:10am


df <- data.frame(a = sample(1:17,15),
                 b = sample(1:17,15),
                 c = sample(1:17,15))

library(tidyverse)

cols_to_make <-rlang::parse_exprs(paste0('ifelse(any(c(a,b,c)==',1:max(df),'),1,0)'))
df2 <- rowwise(df) %>% mutate(!!!cols_to_make)  %>% ungroup
names(df2) <- c(names(df),1:max(df))

# > df2
# # A tibble: 15 x 20
#        a     b     c    `1`   `2`   `3`   `4`   `5`   `6`   `7`   `8`   `9`  `10`   `11`  `12`  `13`  `14`  `15`  `16`  `17`
# <int> <int> <int> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
# 1      4     6    13     0     0     0     1     0     1     0     0     0     0     0     0     1     0     0     0     0
# 2     10     5    10     0     0     0     0     1     0     0     0     0     1     0     0     0     0     0     0     0
# 3      9     9     3     0     0     1     0     0     0     0     0     1     0     0     0     0     0     0     0     0
# 4      7     8    12     0     0     0     0     0     0     1     1     0     0     0     1     0     0     0     0     0
# 5     15    12     5     0     0     0     0     1     0     0     0     0     0     0     1     0     0     1     0     0
# 6     13     1    15     1     0     0     0     0     0     0     0     0     0     0     0     1     0     1     0     0
# 7      5    11     9     0     0     0     0     1     0     0     0     1     0     1     0     0     0     0     0     0
# 8      3     2     8     0     1     1     0     0     0     0     1     0     0     0     0     0     0     0     0     0
# 9     16     4    11     0     0     0     1     0     0     0     0     0     0     1     0     0     0     0     1     0
# 10     8    16    17     0     0     0     0     0     0     0     1     0     0     0     0     0     0     0     1     1
# 11    14     3    14     0     0     1     0     0     0     0     0     0     0     0     0     0     1     0     0     0
# 12     6    14     4     0     0     0     1     0     1     0     0     0     0     0     0     0     1     0     0     0
# 13    11    13     7     0     0     0     0     0     0     1     0     0     0     1     0     1     0     0     0     0
# 14    12     7     1     1     0     0     0     0     0     1     0     0     0     0     1     0     0     0     0     0
# 15    17    15     2     0     1     0     0     0     0     0     0     0     0     0     0     0     0     1     0     1

siddharthprabhu · March 26, 2020, 9:47am

An alternate approach using the pivot_*() family of functions.

library(tidyverse)

set.seed(333)

df <- data.frame(a = sample(1:17,15),
                 b = sample(1:17,15),
                 c = sample(1:17,15))

df %>% 
  rownames_to_column() %>% 
  pivot_longer(cols = -rowname) %>% 
  add_column(count = 1) %>% 
  arrange(value) %>% 
  pivot_wider(id_cols = rowname, 
              names_from = value, 
              values_from = count, 
              values_fill = list(count = 0), 
              values_fn = list(count = mean)) %>% 
  arrange(as.integer(rowname)) %>% 
  select(-rowname)
#> # A tibble: 15 x 17
#>      `1`   `2`   `3`   `4`   `5`   `6`   `7`   `8`   `9`  `10`  `11`  `12`  `13`
#>    <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
#>  1     0     0     0     0     0     1     0     0     0     1     0     0     0
#>  2     0     0     0     0     0     0     0     0     1     0     0     0     1
#>  3     0     0     0     1     0     0     1     0     0     0     0     0     0
#>  4     1     1     0     1     0     0     0     0     0     0     0     0     0
#>  5     0     0     0     0     0     0     0     0     0     0     0     0     1
#>  6     0     0     0     0     1     0     0     0     0     0     1     1     0
#>  7     0     0     1     0     0     0     0     0     1     0     0     0     0
#>  8     0     0     0     0     0     1     0     0     1     0     0     0     0
#>  9     0     0     1     0     0     1     0     0     0     0     0     0     0
#> 10     0     0     0     1     0     0     0     0     0     1     0     1     0
#> 11     0     0     0     0     0     0     0     1     0     0     0     0     1
#> 12     1     0     0     0     0     0     0     0     0     0     1     1     0
#> 13     1     0     0     0     0     0     0     0     0     0     0     0     0
#> 14     0     1     0     0     0     0     0     0     0     0     1     0     0
#> 15     0     0     0     0     1     0     1     0     0     0     0     0     0
#> # ... with 4 more variables: `14` <dbl>, `15` <dbl>, `16` <dbl>, `17` <dbl>

^{Created on 2020-03-26 by the reprex package (v0.3.0)}

I presume you want the 'binary' form of one-hot encoding i.e. the column should contain 1 even if the same number is repeated more than once. If instead you want to represent how many times it occurred, simply change values_fn to list(count = sum).

GreyMerchant · March 27, 2020, 7:06am

This is exactly what I was looking for! Thanks so much

GreyMerchant · March 27, 2020, 7:07am

Thanks for the alternative. It is always great to see multiple ways of doing the same thing!

system · April 3, 2020, 7:07am

This topic was automatically closed 7 days after the last reply. New replies are no longer allowed.