Caret - recursive feature elimination (with upsamling?)


#1

I'm not able to find the right answer in Applied Predictive Modelling or Caret documentation but maybe you guys could help.

What would be the right way of doing RFE on a highly imbalanced classification problem so that the procedure could learn well patterns describing the minority class? The problem I facing right now is that with 5:95 target class ratio the outcome of RFE is not really representative of the patterns that I'm trying to discover. I couldn't find any possibility to use upsamling in the rfe function itself. Is there another way of doing that?


#2

Have you considered doing the upsampling before the RFE? Upsample the data, and then pass it to RFE?


#3

That wouldn't really make sense in the resampling context since model performance estimates would be estimated at least party of the same observations the model was built on, right? So it would need to happen within resampling itself if I'm not mistaken.


#4

Yes. This is shown pretty well on the caret page for subsampling.

All of the RFE methods in caret are based on function modules such as lmFuncs. You would have to make a copy of that and edit the fit part to run one of the sampling functions (like caret:::downSample) on the data just prior to the fit.


#5

@Max

I tried something like this:

rf_fit <- function(x, y, first, last, ...){
loadNamespace("randomForest")

df_up <- caret::upSample(x, y)

randomForest::randomForest(
select(df_up, -Class),
df_up$Class,
importance = (first | last),
...)
}

new_rf <- rfFuncs

new_rf$summary <- rf_stats
new_rf$fit <- rf_fit

but from the model performance point of view (sensitivity vs. specificity) I don't see any difference. I'm I doing something wrong here?


#6

No, that looks fine to me. I've had more luck with down-sampling that up.

Also, since it is random forest, you can have the model internally down-sample the data for each tree.


#7

I can confirm - it worked with down-sampling pretty well Thanks again for your great help!