H2O target encoding on Regression

I'm working with H2O on a Regression problem.

I have like 10 continuous variables and 20 discrete variables. One of these variables have a high cardinality. Then I want to use: Target Encoding for it.

The target variable I need to predict is continuous.

I was reading the following document:

http://docs.h2o.ai/h2o/latest-stable/h2o-docs/data-munging/target-encoding.html

On that specific example they are using a Gradient Boosting Machine for the model on a Classification problem. However, I tried the same steps for my Regression problem.

At some point they say to run the following lines:

# Create a fold column in the train dataset
train$fold <- h2o.kfold_column(train, nfolds = 5, seed = 1234)

# Fit the target encoding map
te_map <- h2o.target_encode_fit(
  train,
  x = list("addr_state"),
  y = response,
  fold_column = "fold"
)

but when I run the second one I get the following error:

ERROR: Unexpected HTTP Status code: 500 Server Error (url = http://localhost:54321/99/Rapids)

java.lang.IllegalStateException
 [1] "java.lang.IllegalStateException: `target` must be a binary categorical vector. We do not support multi-class and continuos target case for now"
 [2] "    ai.h2o.automl.targetencoding.TargetEncoder.ensureTargetColumnIsBinaryCategorical(TargetEncoder.java:156)"                                  
 [3] "    ai.h2o.automl.targetencoding.TargetEncoder.prepareEncodingMap(TargetEncoder.java:105)"                                                     
 [4] "    water.rapids.ast.prims.mungers.AstTargetEncoderFit.apply(AstTargetEncoderFit.java:53)"                                                     
 [5] "    water.rapids.ast.prims.mungers.AstTargetEncoderFit.apply(AstTargetEncoderFit.java:23)"                                                     
 [6] "    water.rapids.ast.AstExec.exec(AstExec.java:63)"                                                                                            
 [7] "    water.rapids.Session.exec(Session.java:85)"                                                                                                
 [8] "    water.rapids.Rapids.exec(Rapids.java:94)"                                                                                                  
 [9] "    water.api.RapidsHandler.exec(RapidsHandler.java:38)"                                                                                       
[10] "    sun.reflect.GeneratedMethodAccessor6.invoke(Unknown Source)"                                                                               
[11] "    sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)"                                                     
[12] "    java.lang.reflect.Method.invoke(Method.java:498)"                                                                                          
[13] "    water.api.Handler.handle(Handler.java:60)"                                                                                                 
[14] "    water.api.RequestServer.serve(RequestServer.java:462)"                                                                                     
[15] "    water.api.RequestServer.doGeneric(RequestServer.java:295)"                                                                                 
[16] "    water.api.RequestServer.doPost(RequestServer.java:221)"                                                                                    
[17] "    javax.servlet.http.HttpServlet.service(HttpServlet.java:755)"                                                                              
[18] "    javax.servlet.http.HttpServlet.service(HttpServlet.java:848)"                                                                              
[19] "    org.eclipse.jetty.servlet.ServletHolder.handle(ServletHolder.java:684)"                                                                    
[20] "    org.eclipse.jetty.servlet.ServletHandler.doHandle(ServletHandler.java:501)"                                                                
[21] "    org.eclipse.jetty.server.handler.ContextHandler.doHandle(ContextHandler.java:1086)"                                                        
[22] "    org.eclipse.jetty.servlet.ServletHandler.doScope(ServletHandler.java:427)"                                                                 
[23] "    org.eclipse.jetty.server.handler.ContextHandler.doScope(ContextHandler.java:1020)"                                                         
[24] "    org.eclipse.jetty.server.handler.ScopedHandler.handle(ScopedHandler.java:135)"                                                             
[25] "    org.eclipse.jetty.server.handler.HandlerCollection.handle(HandlerCollection.java:154)"                                                     
[26] "    org.eclipse.jetty.server.handler.HandlerWrapper.handle(HandlerWrapper.java:116)"                                                           
[27] "    water.webserver.jetty8.Jetty8ServerAdapter$LoginHandler.handle(Jetty8ServerAdapter.java:119)"                                              
[28] "    org.eclipse.jetty.server.handler.HandlerCollection.handle(HandlerCollection.java:154)"                                                     
[29] "    org.eclipse.jetty.server.handler.HandlerWrapper.handle(HandlerWrapper.java:116)"                                                           
[30] "    org.eclipse.jetty.server.Server.handle(Server.java:370)"                                                                                   
[31] "    org.eclipse.jetty.server.AbstractHttpConnection.handleRequest(AbstractHttpConnection.java:494)"                                            
[32] "    org.eclipse.jetty.server.BlockingHttpConnection.handleRequest(BlockingHttpConnection.java:53)"                                             
[33] "    org.eclipse.jetty.server.AbstractHttpConnection.content(AbstractHttpConnection.java:984)"                                                  
[34] "    org.eclipse.jetty.server.AbstractHttpConnection$RequestHandler.content(AbstractHttpConnection.java:1045)"                                  
[35] "    org.eclipse.jetty.http.HttpParser.parseNext(HttpParser.java:861)"                                                                          
[36] "    org.eclipse.jetty.http.HttpParser.parseAvailable(HttpParser.java:236)"                                                                     
[37] "    org.eclipse.jetty.server.BlockingHttpConnection.handle(BlockingHttpConnection.java:72)"                                                    
[38] "    org.eclipse.jetty.server.bio.SocketConnector$ConnectorEndPoint.run(SocketConnector.java:264)"                                              
[39] "    org.eclipse.jetty.util.thread.QueuedThreadPool.runJob(QueuedThreadPool.java:608)"                                                          
[40] "    org.eclipse.jetty.util.thread.QueuedThreadPool$3.run(QueuedThreadPool.java:543)"                                                           
[41] "    java.lang.Thread.run(Thread.java:748)"                                                                                                     

Error in .h2o.doSafeREST(h2oRestApiVersion = h2oRestApiVersion, urlSuffix = page,  : 

ERROR MESSAGE:

`target` must be a binary categorical vector. We do not support multi-class and continuos target case for now

where at the end we can read the following:

`target` must be a binary categorical vector.
We do not support multi-class and continuos target case for now

My questions are:

  1. Is H2O NOT supporting Target Encoding when the target variable is continuous?
  2. If the previous point is TRUE, do you know if they are planning to support it in the future?
  3. Do you know about any R package that supports Target Encoding for discrete variables when the target variable is continuous (Regression)?

On that link, at the very beginning they say:

Target encoding is the process of replacing a categorical value with the mean of the target variable.

"Mean of the target variable"? Normally the mean is only applicable to continuous variables. So, based on that, their algorithm should support continuous target variables.

Thanks!

You could give our vtreat package a try for the pre-processing step. For classification (edit: and regression) problems it supports a cross-validated logistic encoding https://github.com/WinVector/vtreat

Hi @JohnMount, thank you for your suggestion.

Couple of questions here:

  1. Does vtreat package support Target Encoding for discrete variables when the target variable is continuous?

  2. Can I use the vtreat package just to do such Target Encoding by just specifying as params:

  • the dataset I want to apply Target Encoding to
  • the target variable (which will be used to calculate the mean, etc)
  • the discrete variable I want to encode
  • then as return value, I get the previous dataset with a new column which corresponds to the discrete variable encoded

Is that possible?

Thanks!

1 Like

vtreat supports a impact coding of discrete variables when the dependent variable is continuous (we call this the regression case). You can specify what columns you want to reprocess and what data set via the parameters. Here is an example from help(mkCrossFrameNExperiment):

library(vtreat)

# build example
# notice y is related to zip2, but not zip
set.seed(23525)
zip <- paste('z',1:100)
N <- 200
d <- data.frame(zip=sample(zip,N,replace=TRUE),
                zip2=sample(zip,N,replace=TRUE),
                y=runif(N))
del <- runif(length(zip))
names(del) <- zip
d$y <- d$y + del[d$zip2]
d$yc <- d$y>=mean(d$y)

# treat the variables zip and zip2
cN <- mkCrossFrameNExperiment(d,c('zip','zip2'),'y',
   rareCount=2,rareSig=0.9)

# cN$crossFrame is the treated training data, use prepare() with treatments to
# prepare new data later

# notice the zip2 variable is useful, but the zip one is not 
# (this is the cross-frame fighting over-fit for us)
cor(cN$crossFrame$y,cN$crossFrame$zip_catN)  # poor
cor(cN$crossFrame$y,cN$crossFrame$zip2_catN) # better

treatments <- cN$treatments
dTrainV <- cN$crossFrame
1 Like

Thank you @JohnMount for your suggestion.

One last question, what is Impact Coding? is it something similar to: Target Encoding?

Is that a popular term or a term used specifically on vtreat package?

Thanks!

1 Like

"impact coding" is the term Nina Zumel used in 2012 when she first documented variations of the method ( http://www.win-vector.com/blog/2012/07/modeling-trick-impact-coding-of-categorical-variables-with-many-levels/ ). The method goes much further back and we have some notes and references on this in the formal paper https://arxiv.org/abs/1611.09477 .

Notice the documentation of "target coding" said it was a mean, the impact code has a lot of variations. Impact coding is computing a conditional prediction, so it should be very similar to target coding.

1 Like

Thank you @JohnMount for the clarification.

1 Like

This topic was automatically closed 7 days after the last reply. New replies are no longer allowed.