embed package target encoding issue

amin0511 · March 15, 2023, 7:03am

i am trying to use lmer mixed model based to target encode my categorical variable. The target is binary "Disposition" and the categorical variable "RuleCombination" has more than 700 levels. The problem is that for some levels, i end up with two levels. For example the level "HRG" has the following proportion in terms of the target:
|Categorical Variable level| NFA|RFA|Grand Total|
|-- -|---|---|---|
|HRG| 8243| 85| 8328|
"NFA" and "RFA" are the levels of the binary outcome. After encoding i get the following embedded values:
|HRG|16.76811985|
|HRGDispostion|-9.285740011|
For the original HRG level i now get "HRG" and "HRGDisposition".
I don't know why this only happens for some of the levels.

data_mixed <-
recipe(Disposition ~ ., data = Data202101_train) %>%
step_lencode_mixed(
RuleCombination,
outcome = vars(Disposition),
) %>%
prep(training = Data202101_train)

technocrat · March 15, 2023, 7:46am

The question can't be fairly evaluated without a reprex. See the FAQ. High cardinality data such as yours is an active area of research. This recent paper provides guidance for lmer (regression) and glmer (classification) functions from the lme4 package in R as an efficient way to fit glmms. (With R code linked.)

amin0511 · March 15, 2023, 3:26pm

thank you for the article. my question though had more to do with the peculiar result rather than the approach. The example discussed in "Embed" site (Using Generalized Linear Models • embed) embeds the levels of the categorical data but only adds one new level ("..new") to accommodate future unseen levels. Mine, adds many new levels with the target variable's name as the "suffix" of the new levels. for example, "HRG" is a level in the categorical variable, but now there is "HRGDisposition". by the way, the response is very imbalanced (98.7/1.3). The cardinality of the categorical variable is over 700. Could the combination of these two have something to do with this result?

system · April 5, 2023, 3:27pm

This topic was automatically closed 21 days after the last reply. New replies are no longer allowed.

If you have a query related to it or one of the replies, start a new topic and refer back with a link.