step_other Vs. step_novel Vs. step_unknown

Can somebody explain to me what the difference between step_other, step_novel and step_unknown is? Are there specific situations where I should use one over the other?

In general I love the tidymodels package and its tidy workflow to build ML models but I feel a bit overwhelmed with the almost 100 different step functions in the recipes package and I struggle to understand when to use what.

It depends on the model but, basically, the new levels have no effect on the model fit. They mostly help avoid errors when predicting.

If there is a linear regression via lm(), the new level will be all zeros and the coefficient will be NA. For trees, the level will get bundled into splits with other levels. For example, a factor with 5 known levels and one new level might be split as {a,c,d} vs {b,e,new}. In this situation, there's no way to tell where it will end up.

There are some feature preprocessors that can estimate new levels (in the embed package) and those might be able to assign a value to the new level.

tl;dr

There is no effect on the model and perhaps not even the predictions. step_novel() helps avoid errors.

1 Like

Hi @noveld,

All three deal with handling factor levels in some way, so they definitely have a similar feel, but there are differences between them:

  • step_other is useful when you have some factor levels with very few observations and it makes sense to collapse some of these levels into a single new level called "other".

  • step_novel is useful when you may have factor levels that have not yet been seen in the data (i.e. not present in levels()). This step will take any new (previously unseen) factor levels and group them into a new factor level called "new". This can happen when there are factor levels in the testing data that were (for whatever reason) not present in the training data.

  • step_unknown is useful when you have missing data (NA) in your factor variable and rather than dropping this missing data (or some other approach), you simply set all missing factor data to a new factor level called "unknown".

So to summarise, step_other is for collapsing levels with few observations, step_unknown is for missing data, and step_novel is for previously unseen factor levels.

Hope this is helpful.

2 Likes

This is a great explanation. Thanks a lot!

1 Like

@mattwarkentin , thanks for the great answer. I just have a quick follow-up question on step_novel(). Since it is designed specifically for the new factors levels presented in the testing data, how would the model fitted on the training data predict these data points with the assigned new levels on the testing data? Thanks!