How to make one hot encoding for variables with two outputs

I have the following dataframe with columns with the same name (e.g. AA and AA.1) indicating a feature with two outputs. I'd like to make one hot encoding of this dataframe as follows:

Original dataframe:

 data.table::data.table(
         AA = c("12", "11", "13"),
        AA.1 = c("11", "7", "13"),
         BB = c("3", "4", "7"),
        BB.1 = c("8", "9", "3")
 )

Final dataframe:

data.table::data.table(
        AA.7 = c(0, 1, 0),
       AA.11 = c(1, 1, 0),
       AA.12 = c(1, 0, 0),
       AA.13 = c(0, 0, 1),
        BB.3 = c(1, 0, 1),
        BB.4 = c(0, 1, 0),
        BB.7 = c(0, 0, 1),
        BB.8 = c(1, 0, 0),
        BB.9 = c(0, 1, 0)
)

I tried to use dplyr and tidyr but I don't know how to deal with such duplex output.

Here is something I did to workaround. It might not be a perfect way but it is something you can think of. If you have more columns to work with you need to come up with an idea so that groups of columns can be selected together. Here I use "AA", "BB" as indicators to loop through and select group of columns that start with AA and BB. Hope it will help.

library(tidyverse)
library(data.table)
#> 
#> Attaching package: 'data.table'
#> The following objects are masked from 'package:dplyr':
#> 
#>     between, first, last
#> The following object is masked from 'package:purrr':
#> 
#>     transpose
data<- data.table::data.table(
  AA = c("12", "11", "13"),
  AA.1 = c("11", "7", "13"),
  BB = c("3", "4", "7"),
  BB.1 = c("8", "9", "3")
) 
# giving them ids

data$ID<- rownames(data)

## creating a empty data frame
data_enc<- data.frame()[1:nrow(data), ]
rownames(data_enc)<- c(1:nrow(data))

index<- c("AA", "BB")

for (i in index){

loop<- setDT(melt(data %>% select(starts_with(i), ID),id.vars = c("ID")))[,ind:=1] %>% mutate(value = paste0(i,value)) %>% dcast(.,ID~value,value.var = "ind",fill=0, fun.aggregate = sum)

data_enc<- cbind(data_enc, loop)

}

# some might appear in two places like 13 
data_enc<- data_enc %>% select(-ID)
data_enc[data_enc>=2] = 1 
data_enc
#>   AA11 AA12 AA13 AA7 BB3 BB4 BB7 BB8 BB9
#> 1    1    1    0   0   1   0   0   1   0
#> 2    1    0    0   1   0   1   0   0   1
#> 3    0    0    1   0   1   0   1   0   0
1 Like

Dear @gtmbini, thank you very much for this, you solved my task.
Thanks for you kind help!

Eugenio

This topic was automatically closed 7 days after the last reply. New replies are no longer allowed.

If you have a query related to it or one of the replies, start a new topic and refer back with a link.