recipe step_medianimpute within group

I am just wondering whether there is any functionality to do the median imputation within a group. For example, if I have an income column and a zip code column, I want to impute the income within one zip code.

We don't have that. You would probably get something similar to what you want using step_impute_bag() or step_impute_linear().

Thanks, Max! When using step_impute_linear(), if I have a categorical variable and I want to do step_integer for the model and step_dummy for step_impute_linear, is there any way to do that?

I'm not sure. Can you tell us what you want to do by showing the recipe (without the imputation parts)?

Here is the recipe without imputation. division and super_department are categorical variables.

preproc = recipe(
    cpc ~ ., 
    data = lag_df) %>% 
  step_integer(division, super_department) %>%
  step_normalize(recipes::all_predictors(), -division, -super_department) %>% 
  step_zv(recipes::all_predictors()) %>%
  prep()

rf = rand_forest() %>%
  set_mode("regression") %>%
  set_engine("ranger", num.threads = 30, importance = "impurity")

wflow <- 
  workflow() %>% 
  add_recipe(preproc) %>%
  add_model(rf)

I would put the imputation parts first. However, based on your first message, I'm not sure what you are imputing.

Also:

If you don't mind me asking, what is the purpose of this? These seem like unordered categorical data and splitting on them as if they are numeric scales would be bad.

I have a few more features with type double to impute which is not listed in the recipe.

step_integer(division, super_department)

is something I don't like but not sure how to handle it in another way. The number of distinct values for those variables is high and I will be out of memory easily if I do step_dummy. I probably can use step_other.

I suggest using step_lencode_mixed() to convert them to a numeric feature (each)

Overall, I'd put the imputation steps first.

1 Like

This topic was automatically closed 21 days after the last reply. New replies are no longer allowed.

If you have a query related to it or one of the replies, start a new topic and refer back with a link.