Hello all,
I'm rather stuck trying to train ML algorithms using tidymodels
. From what I gather, tune
and tidymodels
are fully well capable of training models through cross validation where in essence all samples are used for both training and validation purposes. I struggle to see how to use tune for my problem, which is slightly different. I have a small set of high fidelity 'reference' samples, from which I'd like to predict a small number of continuous variables for a much larger set of lower confidence 'query' samples, a subset of which is labelled. I'd like to use the labelled subset of query samples to find the optimal regression algorithm (e.g. KNN regression) and optimize its hyperparameters. Nested cross-validation is the obvious solution for this. Can I use the functionality in tune
/tidymodels
out of the box to force the training to happen on the reference samples only and assess model performance on the labelled set of query samples? The one thing I can imagine doing is running nested_cv
on the labelled query data and then surgically editing it afterwards to include the reference samples: including the reference samples, offsetting the indices such that each (nested) fold is strictly trained on these reference samples and validated on the query samples. This seems rather hackish though. I feel like I'm missing an obvious solution. Is there one? Thanks!