Dear fellows of Tidymodel community,
I have developed a method to tune machine learning for cases when the number of predictors are large in relation to the number of observations, typical in many areas of study were fieldwork or experiment are required to collect "ground true" to train the model.
The method is termed Naïve Overfitting Index Selection (NOIS). The advantage are:
(1) provides an efficient and structured method to tune ML over an unknown range of appropriate tuning parameters, by gradual increasing model complexity for non-OLS regressions;
(2) determines the maximum level of model complexity supported by a specific data structure without overfitting;
(3) quantifies the relative amount of overfitting across regression techniques consistently, highlighting the trade-off between prediction accuracy and overfitting; and
(4) quantifies overfitting based on a single error estimation, using all available observations to select the best model.
More deatial can be seen in the paper https://doi.org/10.1016/j.isprsjprs.2017.09.012
In this paper, the performance of models derived from this tuning method is compared to methods based on robust cross-validation and tested using caret package to fit different datasets and regression techniques . The idea is to reproduce/generate a set of presdictors with the same mean, standard deviation and covariance than the original dataset, but uncorrelated with the response variable. Then tuning the model with the mock data with has the same structure (n and p) of the original set until the model complexity start overfitting, so this is the maximum model complexity for this ML supported by this data structure.
If the community is interested in the method, I can further explain and help later to develop a function to be used in the tidemodel packages (tune, recipes and parsnip).