Choosing Filter Method before modelling



This is a relatively beginner question. I am currently looking at filtering methods to apply to a data-set before modelling to find out which predictors are statistically significant. I wish to test a variable (variable A) against the a two class outcome so basically a factor variable of "Yes" or "No". I know that if it was just variable A and my outcome i could use a t-test and a Bonferroni correction so as to avoid a false positive to check if there was a difference between the means. However variable A is measured across 4 groups. My question is actually about the question to research for filter methods specifically :slight_smile:

  • Can i just use four different t-tests per group to see if the means differ with respect to my target/outcome
  • Is there a more appropriate test? I have 17 more variables in the same boat so I would like to do this correctly

Thank you for your time


Four different tests is fine. You can either do the correction (although I would suggest an FDR correction instead) or just have a threshold for significance on the raw values (there's only four).

Our new book discusses this in the context of interactions (see this section).

Another option is to do an ROC curve and calculate the area under the curve.

The most important thing to to make sure that you do this inside of resampling so that you know that you are not overfitting to the features by reusing the same data for feature selection and model fitting. This is discussed in a few places in the book and will come up again in the as-yet-unwritten chapters. Slightly altered money quote:

Resampling was used to [compare feature sets] for different versions of the training set. For each resample, the assessment set is used the evaluate the objective function. In the traditional approach, the same data used to [evaluate the features] and used to evaluate the model (which is understandably problematic for some models). If the model overfits, computing the objective function from a separate data set allows a potential dissenting opinion that will reflect the overfitting.

I'll start writing a new package to do this (hopefully by the end of the year) but, for now, caret::sbf (for Selection By Filter) can do this for you.

Yes. More variables is not a deal-breaker.


Hi @Max

Thanks once again for taking the time out to answer my questions. To just clarify for myself:
Assuming i have training and a test set. I would do following in the 10 fold re-samples of the training set

  • I would apply the t-test in each re-sample for each variable I'm interested in, in relation to the target variable. So above my factor variable of "yes" or "no"

  • I would then get the test statistic result and have a look at the p-value for that variable within that re-sample. Lets say i have 10 samples, i would get ten test statistics per attribute. Is there a way to interpret them on the whole or would i have to use something like a volcano plot that you highlight in your book, applied predictive modelling (i assume you are one of the authors :slight_smile: ). I would do this for my 20 odd variables or so

  • Assuming, I use the filter techniques as a method of exploration without using it to dictate which variables i am adding to the model, your recommendation is that I cannot use the same data to train my model (for example MARs or Random Forest) on the same set of re-samples?



The other parts look good to me but I wasn't sure what you meant here.

Suppose that you have 13 predictors and you do 10-fold CV.

Within each fold, you compute the t-test on the 90% of the data used for modeling. Based on their p-values (and your significance threshold), you filter the variables down and fit the model using the predictors that survive the filter. Each fold (potentially) has different predictors that go into the model.

Within each fold, you evaluate the model on the 10% held out. Average the 10 performance estimate to get the resampling estimate.

If you like the performance, you do the same process to the entire training set (t-test -> threshold -> model) and the resampled performance from 10-fold CV is the estimate for this final model.