Four different tests is fine. You can either do the correction (although I would suggest an FDR correction instead) or just have a threshold for significance on the raw values (there's only four).
Our new book discusses this in the context of interactions (see this section).
Another option is to do an ROC curve and calculate the area under the curve.
The most important thing to to make sure that you do this inside of resampling so that you know that you are not overfitting to the features by reusing the same data for feature selection and model fitting. This is discussed in a few places in the book and will come up again in the as-yet-unwritten chapters. Slightly altered money quote:
Resampling was used to [compare feature sets] for different versions of the training set. For each resample, the assessment set is used the evaluate the objective function. In the traditional approach, the same data used to [evaluate the features] and used to evaluate the model (which is understandably problematic for some models). If the model overfits, computing the objective function from a separate data set allows a potential dissenting opinion that will reflect the overfitting.
I'll start writing a new package to do this (hopefully by the end of the year) but, for now, caret::sbf (for Selection By Filter) can do this for you.
Yes. More variables is not a deal-breaker.