Preparing data for 10-fold cross validation (need help splitting my dataset)

cwright1 · June 16, 2022, 12:46pm

I'm working on a machine learning project that requires me to split my data into 2 groups (as is common in machine learning): A training set (90% of the original data) and a test set (10% of the data).

My data come in replicates, however: each sample is measured 3 times (appended with _R1/_R2/_R3). It is critical that when splitting the data, the replicates remain together: (if xxx_R1 is in the test set, then xxx_R2 and xxx_R3 also need to be in the test set.).

In a previous post , I made a Minimal Reprex for this problem, and @FJCC solved it! The data could be split into 90% and 10% sets while keeping the replicates together.

library(tidyverse)
R1 <- paste0(rownames(mtcars),"_R1")
R2 <- paste0(rownames(mtcars),"_R2")
R3 <- paste0(rownames(mtcars),"_R3")

mydf <- data.frame("samples" = Reduce(union, c(R1,R2,R3)))

#Randomly shuffle the rows to simulate my 'real' data
mydf <- data.frame("samples"=mydf[sample(1:nrow(mydf)),])
mydf <- mydf %>% separate(samples,into = c("Root","Repeat"),
                         remove = FALSE, sep = "_")

Roots <- unique(mydf$Root)
group1_number <- ceiling(0.9 * length(Roots)) 
Group1 <- sample(Roots, group1_number)

group1_df <- mydf %>% filter(Root %in% Group1)
group2_df <- mydf %>% filter(!Root %in% Group1)

My question now: How do make 10 different iterations of this, where a different 10% is kept for each test set (and different 90% for each training set) ? Also known as 10-fold cross validation.

I want to add 10 columns to the data: FOLD1 through FOLD10. In these columns the values will be "training" or "test".

FJCC · June 16, 2022, 1:34pm

Here is one possible method.

library(tidyverse)
#> Warning: package 'tibble' was built under R version 4.1.2
R1 <- paste0(rownames(mtcars),"_R1")
R2 <- paste0(rownames(mtcars),"_R2")
R3 <- paste0(rownames(mtcars),"_R3")

mydf <- data.frame("samples" = Reduce(union, c(R1,R2,R3)))

#Randomly shuffle the rows to simulate my 'real' data
mydf <- data.frame("samples"=mydf[sample(1:nrow(mydf)),])
mydf <- mydf %>% separate(samples,into = c("Root","Repeat"),
                          remove = FALSE, sep = "_")

Roots <- unique(mydf$Root)
group1_number <- ceiling(0.9 * length(Roots)) 

for(i in 1:10) {
  Group1 <- sample(Roots, group1_number)
  ColName <- paste0("FOLD",i)
  mydf[[ColName]] <- ifelse(mydf$Root %in% Group1, "Train","Test")
}

^{Created on 2022-06-16 by the reprex package (v2.0.1)}

system · June 23, 2022, 1:34pm

This topic was automatically closed 7 days after the last reply. New replies are no longer allowed.

If you have a query related to it or one of the replies, start a new topic and refer back with a link.