recipe step_medianimpute within group

ajing · January 15, 2021, 12:55am

I am just wondering whether there is any functionality to do the median imputation within a group. For example, if I have an income column and a zip code column, I want to impute the income within one zip code.

Max · January 15, 2021, 5:19pm

We don't have that. You would probably get something similar to what you want using step_impute_bag() or step_impute_linear().

ajing · January 19, 2021, 2:42am

Thanks, Max! When using step_impute_linear(), if I have a categorical variable and I want to do step_integer for the model and step_dummy for step_impute_linear, is there any way to do that?

Max · January 19, 2021, 7:44pm

I'm not sure. Can you tell us what you want to do by showing the recipe (without the imputation parts)?

ajing · January 20, 2021, 2:37am

Here is the recipe without imputation. division and super_department are categorical variables.

preproc = recipe(
    cpc ~ ., 
    data = lag_df) %>% 
  step_integer(division, super_department) %>%
  step_normalize(recipes::all_predictors(), -division, -super_department) %>% 
  step_zv(recipes::all_predictors()) %>%
  prep()

rf = rand_forest() %>%
  set_mode("regression") %>%
  set_engine("ranger", num.threads = 30, importance = "impurity")

wflow <- 
  workflow() %>% 
  add_recipe(preproc) %>%
  add_model(rf)

Max · January 20, 2021, 3:44pm

I would put the imputation parts first. However, based on your first message, I'm not sure what you are imputing.

Also:

If you don't mind me asking, what is the purpose of this? These seem like unordered categorical data and splitting on them as if they are numeric scales would be bad.

ajing · January 20, 2021, 10:35pm

I have a few more features with type double to impute which is not listed in the recipe.

step_integer(division, super_department)

is something I don't like but not sure how to handle it in another way. The number of distinct values for those variables is high and I will be out of memory easily if I do step_dummy. I probably can use step_other.

Max · January 21, 2021, 7:45pm

I suggest using step_lencode_mixed() to convert them to a numeric feature (each)

Overall, I'd put the imputation steps first.

system · February 11, 2021, 7:45pm

This topic was automatically closed 21 days after the last reply. New replies are no longer allowed.

If you have a query related to it or one of the replies, start a new topic and refer back with a link.