What to include in resamples

john.smith · October 25, 2018, 1:57pm

Hi,

I understand when building a model you estimate the model fit based on the average of fit statistics across all resamples.
So for example if I measure the accuracy of a classifier across a 10 fold cross validation, I would get the accuracy of each analysis and assessment test set producing 10 accuracy metrics. To determine my overall accuracy I would take the mean of this and this would give me a very good assessment of how my model is doing. Once happy I could then fit the model to the entire training data set. I have two questions I was hoping to get help with

Can this be extended to other areas. For example an ROC curve where you plot the true positives against the true negatives. Is it good practice to do this for each resample or should you just use the final model prediction probabilities. For the resampled method there would be ten ROC curves but for the single training model just the one

Could this be used for variable importance? I know decision trees and GBMs take the number of times a variable was split on and then average it across the ensemble. For a more simple model like a decision tree which has high variance and tends to over-fit, does it make sense to get the variable importance within each resample and then average them overall using something like caret::Varimp. So if I was to create ten data-frames for each cross section of variable importance and then take the average overall, is this preferable than taking the final fit model and estimating variable importance based on that?

Thank you for your time

Andrea · October 26, 2018, 10:52pm

Hi, John, and welcome to the RStudio Community!

I've been trying to muster the time needed to write a better answer, but alas, without success. Following Voltaire's maxim that the best is enemy of the good, here is a sub-optimal answer I'll leave it to someone else to touch decision trees & GBMs, since I don't know about the use of cross-validation to estimate the variable importance for these models. For decision trees, you may have a look here: it's actually about the generalization error (see below), but it might contain useful references (or you may ask the author).

Re: ROC curves. What would you like to learn from the 10 different ROC curves? If you're trying to compute confidence intervals for the ROC curve, then using the 10 ROC curves is not a rigorous approach (i.e., there's no proof they'll converge to the right confidence intervals). You can at most consider them as an estimate of how sensitive the classifier is to the training set. See for example here for some Python code: you could of course do the same in R. Two caveats:

you can only do this if the classified outputs probabilites, rather than simply classes (but I'm sure you know this already)
as the validation sets get smaller, this procedure becomes more and more dubious. For example, for leave-one-out cross-validation, each of the N ROC curves would be computed based on a single test point...thus the variance of the estimator would be pretty large.

In general, cross-validation is introduced because the training error, i.e., the error of the learned model on the training set:

CodeCogsEqn%20(1)

is usually an optimistic estimate of the test error, also called generalization error or out-of-sample error:

i.e., the average over all learnable models (in our hypothesis class) of the average error over all possible test sets¹. When we say that it's a optimistic estimate, we mean that the difference

called the optimism, is often larger than 0.

A better estimate of the test error is the K-fold cross-validation error:

CodeCogsEqn%20(7)

The notation is a bit complicated, but the concept is simple:

split our dataset T, of size N, in K parts of size m
fit your model K times to training sets of size N-m, each obtained by removing one split of size m from T, and each time compute the error on the hold-out set
average the K errors together

Thus, usually the cross-validation error is used as an estimate of the test error. Starting from here, you can of course also use it as an heuristic for all sort of statistical decisions, such as for example model selection. But it's just that - an heuristic. You're not sure it will always work. For example, for model selection, it's well known that, if the true model belongs to the class of models you can learn with your learning algorithm, then neither LOOCV or leave-K-out cross-validation (a form of cross-validation where you test on all possible folds of size K) are consistent, i.e., they won't select the true model as N goes to infinity (though leave-K-out cross-validation becomes consistent, if K increases sufficiently fast with N). You may argue that consistency is not necessarily an interesting/desirable property, because for many practical case we don't know whether the true model belongs to our hypothesis space or not. That's true. But, for many practical cases, we don't even know how much less optimistic the cross-validation error is, with respect to the test error. For a review of what theory actually tells us about cross-validation, see

_{¹We would be actually more interested in estimating the conditional test error, i.e., the average error of our learned model over all possible test sets}

_{but I don't think efficient estimates of this error are currently known.}

Andrea · October 27, 2018, 7:07pm

Hey, I just found out a very simple explanation on why variable importance for random forests is often biased, on how to fix it using permutations. Random forests are not GBMs, and permutation is not exactly resampling, though it can be seen as a particular type of resampling (specifically, it’s resampling from the training set without resampling, with a sample size equal to that of the training set). But, it’s still close enough that it could solve your problem! Here it is:

john.smith · October 29, 2018, 9:03am

Hi @Andrea

Thank you very much to going to all this effort. Genuinely this is amazing. I need to go through the papers you linked to. I will definitely try out your second post on a data-set and post the r-code here for comparison

Thanks
John