How to make one hot encoding for variables with two outputs

eugenio.alladio · March 24, 2021, 11:28am

I have the following dataframe with columns with the same name (e.g. AA and AA.1) indicating a feature with two outputs. I'd like to make one hot encoding of this dataframe as follows:

Original dataframe:

 data.table::data.table(
         AA = c("12", "11", "13"),
        AA.1 = c("11", "7", "13"),
         BB = c("3", "4", "7"),
        BB.1 = c("8", "9", "3")
 )

Final dataframe:

data.table::data.table(
        AA.7 = c(0, 1, 0),
       AA.11 = c(1, 1, 0),
       AA.12 = c(1, 0, 0),
       AA.13 = c(0, 0, 1),
        BB.3 = c(1, 0, 1),
        BB.4 = c(0, 1, 0),
        BB.7 = c(0, 0, 1),
        BB.8 = c(1, 0, 0),
        BB.9 = c(0, 1, 0)
)

I tried to use dplyr and tidyr but I don't know how to deal with such duplex output.

gtmbini · March 24, 2021, 7:30pm

Here is something I did to workaround. It might not be a perfect way but it is something you can think of. If you have more columns to work with you need to come up with an idea so that groups of columns can be selected together. Here I use "AA", "BB" as indicators to loop through and select group of columns that start with AA and BB. Hope it will help.

library(tidyverse)
library(data.table)
#> 
#> Attaching package: 'data.table'
#> The following objects are masked from 'package:dplyr':
#> 
#>     between, first, last
#> The following object is masked from 'package:purrr':
#> 
#>     transpose
data<- data.table::data.table(
  AA = c("12", "11", "13"),
  AA.1 = c("11", "7", "13"),
  BB = c("3", "4", "7"),
  BB.1 = c("8", "9", "3")
) 
# giving them ids

data$ID<- rownames(data)

## creating a empty data frame
data_enc<- data.frame()[1:nrow(data), ]
rownames(data_enc)<- c(1:nrow(data))

index<- c("AA", "BB")

for (i in index){

loop<- setDT(melt(data %>% select(starts_with(i), ID),id.vars = c("ID")))[,ind:=1] %>% mutate(value = paste0(i,value)) %>% dcast(.,ID~value,value.var = "ind",fill=0, fun.aggregate = sum)

data_enc<- cbind(data_enc, loop)

}

# some might appear in two places like 13 
data_enc<- data_enc %>% select(-ID)
data_enc[data_enc>=2] = 1 
data_enc
#>   AA11 AA12 AA13 AA7 BB3 BB4 BB7 BB8 BB9
#> 1    1    1    0   0   1   0   0   1   0
#> 2    1    0    0   1   0   1   0   0   1
#> 3    0    0    1   0   1   0   1   0   0

eugenio.alladio · March 25, 2021, 10:29am

Dear @gtmbini, thank you very much for this, you solved my task.
Thanks for you kind help!

Eugenio

system · April 1, 2021, 10:29am

This topic was automatically closed 7 days after the last reply. New replies are no longer allowed.

If you have a query related to it or one of the replies, start a new topic and refer back with a link.