Model selection and hyperparameter optimization with a fixed 'reference' training dataset

Hello all,

I'm rather stuck trying to train ML algorithms using tidymodels. From what I gather, tune and tidymodels are fully well capable of training models through cross validation where in essence all samples are used for both training and validation purposes. I struggle to see how to use tune for my problem, which is slightly different. I have a small set of high fidelity 'reference' samples, from which I'd like to predict a small number of continuous variables for a much larger set of lower confidence 'query' samples, a subset of which is labelled. I'd like to use the labelled subset of query samples to find the optimal regression algorithm (e.g. KNN regression) and optimize its hyperparameters. Nested cross-validation is the obvious solution for this. Can I use the functionality in tune/tidymodels out of the box to force the training to happen on the reference samples only and assess model performance on the labelled set of query samples? The one thing I can imagine doing is running nested_cv on the labelled query data and then surgically editing it afterwards to include the reference samples: including the reference samples, offsetting the indices such that each (nested) fold is strictly trained on these reference samples and validated on the query samples. This seems rather hackish though. I feel like I'm missing an obvious solution. Is there one? Thanks!

You should be able to do this as a validation set (usually via validation_split()) but making the split object using a manual split (with manual_rset()). See the example and that should get you there.

Thanks Max! I ended up running nested_cv and then editing the splits and inner_resamples columns using manual_rset.

Now onto the next stage, actually fitting the models to the resampled data. I was under the impression that I would get nested CV 'for free' once my data was in the rsample/tune format but that's not the case as I now understand. I had already seen this tutorial, but didn't understand the need for the 'manual' resample evaluation code. Perhaps include a sentence in there stating that tune currently does not support nested resampling?

This topic was automatically closed 7 days after the last reply. New replies are no longer allowed.

If you have a query related to it or one of the replies, start a new topic and refer back with a link.