How avoid splitting specific pairs of samples with rsample (tidymodels)

Dear all,

I am currently using tidymodels to develop a series of classifiers in a clinical setting.
As I am trying to learn features from a small set of patient-derived samples (~150, with only 20 of the positive class) I am facing the following problem.

I have a number of cases for which more than one sample has been retrieved from the same patients (at most a pair). In those cases, I feel like having one sample in the training set and one in the validation (same goes for v-fold cv) is formally not correct. In that way, I would be able to predict the correct label by "similarity" and not achieve generalization by class. For this reason, I would like to keep multiple samples from the same patient in the same fold. That way, I would avoid having spurious patient-specific information that would inflate positively the performance of the model. However, I don't think this is a functionality available in the package, and I'm having a hard time translating this idea into code.

TLDR: is there any straightforward way to specify that some samples should not be split and should be treated as a single entity, or alternatively, to only keep folds that respect this criterion?

Do you have any suggestions?
Thank you for any of your inputs,
Best

I guess what you are looking for is group_vfold_cv(). This is quoted from its help page

Group V-fold cross-validation creates splits of the data based on some grouping variable (which may have more than a single row associated with it). The function can create as many splits as there are unique values of the grouping variable or it can create a smaller set of splits where more than one value is left out at a time.

This can be used for cross validation resamples. For initial train test split, I don't think it is implemented yet. There is an open Github issue for it.

I came across a similar problem as yours. I used the below code to re-engineer the initial split. In the below code, SiteID is my grouping variable; that's the variable that I want all its unique instances to end up in a single split.

# Train Test Split --------------------------------------------------------------------
set.seed(123)
train_test_split <- initial_split(data)

# re-engineer train_test_split to include different sites in train & test
split_prop = 0.7
set.seed(125)
train_sites <- sample.int(length(levels(data$SiteID)), split_prop*length(levels(data$SiteID)))
train_test_split$in_id <- row.names(data)[as.numeric(data$SiteID) %in% train_sites]
data_train <- training(train_test_split)
data_test <- testing(train_test_split)

Then I used this for cross validation resamples,

# resamples
set.seed(123)
resamples <- group_vfold_cv(data_train,  group = "SiteID",  v = 8)
2 Likes

This worked perfectly, from the manual I didn't realize that the variable group would be used to group samples! It sounded something similar to "strata" in vfold_cv to me. Thank you for your kind help!

1 Like

This topic was automatically closed 7 days after the last reply. New replies are no longer allowed.

If you have a query related to it or one of the replies, start a new topic and refer back with a link.