I am currently using tidymodels to develop a series of classifiers in a clinical setting.
As I am trying to learn features from a small set of patient-derived samples (~150, with only 20 of the positive class) I am facing the following problem.
I have a number of cases for which more than one sample has been retrieved from the same patients (at most a pair). In those cases, I feel like having one sample in the training set and one in the validation (same goes for v-fold cv) is formally not correct. In that way, I would be able to predict the correct label by "similarity" and not achieve generalization by class. For this reason, I would like to keep multiple samples from the same patient in the same fold. That way, I would avoid having spurious patient-specific information that would inflate positively the performance of the model. However, I don't think this is a functionality available in the package, and I'm having a hard time translating this idea into code.
TLDR: is there any straightforward way to specify that some samples should not be split and should be treated as a single entity, or alternatively, to only keep folds that respect this criterion?
Do you have any suggestions?
Thank you for any of your inputs,