How to apply one_hot manually

myusername · October 15, 2019, 2:31am

Hello I am a postgrad student and I have an assignment where I am asked to do hot encoding on categorical columns in a dataframe. The process must be completed manually without applying the function one_hot. My question is that I would like to understand what steps does the function apply so I can apply it manually to my dataframe.

#homework

andresrcs · October 15, 2019, 2:41am

one_hot() is not a base R function so I think you should specify which package are you referring to, ideally, you should present your issue as a REPRoducible EXample (reprex).
Please take a look at our homework policy to learn how to properly ask homework-inspired questions here.

myusername · October 15, 2019, 2:44am

I honestly don't know. That is why I am asking.

FJCC · October 15, 2019, 2:45am

You can see the code of a function by typing its name, without parentheses, in the console.

Leon · October 15, 2019, 11:54am

What @FJCC is suggesting here @myusername is, that if you at the console type:

> one_hot

I.e. without the parantheses, then you can see how the function works. The likely reason that people are reluctant with giving you the answer is, that you will learn nothing simply from getting the code. The whole process of thinking the solution through and implementing and checking that it works - That's where you learn something

Give it a shout - If you get stuck, then return here with what you did and where you're stuck, then I'm certain that people will gladly chip in

myusername · October 15, 2019, 6:24am

I am trying to apply the function one_hot manually in R for an assignment.

sample of my dataset

a <- c('red','red','green')
b <- c('large', 'medium', 'small')
c <- c('wide','narrow','narrow')

df <- data.frame(a, b, c)

using the one_hot function from the package scorecard returns this output

one_hot(df)

output

  a_green a_red b_large b_medium b_small c_narrow c_wide
1:       0     1       1        0       0        0      1
2:       0     1       0        1       0        1      0
3:       1     0       0        0       1        1      0

I would like to create the same output without using the function. So far I did those steps:

converted the categorical columns to factors

for (i in colnames(df)) {
        df[i] <- do.call(cbind.data.frame, lapply(df[i], as.factor))}

found the length of the levels (k). I wrote this function

to.encode<-c('a','b','c')

one.hot <- function(df, to.encode) {
  len=c()
  k=sapply(df[to.encode], levels)
  for (i in k) {
    if (!is.null(i)){
      len<-length(i)-1
      print(len)
    }
  }
}

output is the length of the levels minus 1 (k-1)

> one.hot(df)
[1] 1
[1] 2
[1] 1

Now I want to create (k-1) new columns for each categorical column. I want to set the value to 1 if the original variable's value corresponded to the column, and 0 otherwise.

Any advice on how to take this to the next step? Thank you

Leon · October 15, 2019, 7:43am

Please refrain from re-posting your question it will clutter the discussion board. I recommend continuing in the original thread.

Leon · October 16, 2019, 11:40am

Hi @myusername,

Here is a bit of generic code for one-hot encoding to get you started:

set.seed(859315)
n = 10
categories = sample(x = seq(from = 1, to = 3), size = n, replace = TRUE)
one_hot = t(sapply(X = categories, FUN = function(x_i){
  v = c(0, 0, 0)
  v[x_i] = 1
  return(v)
}))

Yielding:

> categories
 [1] 1 3 1 2 1 2 1 1 1 1
> one_hot
      [,1] [,2] [,3]
 [1,]    1    0    0
 [2,]    0    0    1
 [3,]    1    0    0
 [4,]    0    1    0
 [5,]    1    0    0
 [6,]    0    1    0
 [7,]    1    0    0
 [8,]    1    0    0
 [9,]    1    0    0
[10,]    1    0    0

See if you can formalise it and apply it to your data

system · November 6, 2019, 11:40am

This topic was automatically closed 21 days after the last reply. New replies are no longer allowed.