Splitting sample into test-train for gradient boosting

Hi everyone:

I am running a gradient boosted tree with a 50-50 split between training and testing sample. I am using panel data; 5-6 observations per year. I want to split the data into train-test samples in such a way that for a given year, all the observations are either in the testing set or in the training set. Basically, I want to split the sample by year so that in the testing sample I don't lose any observations for a given year. Can anybody help? Thanks!

I assume you know the years and the number of observations per year is pretty close to the same each year.

df <- data.frame(
year=c(2019,2020,2021,2022,2019,2020,2021,2022,2019,2020,2021,2022),
sales=c(1,2,3,4,5,6,7,8,9,10,11,12))
set.seed(1234)
s <- sample(df$year, size = 2, replace = FALSE)
train <- df[df$year %in% s,]
train
test <- df[!(df$year %in% s),]
test

This topic was automatically closed 21 days after the last reply. New replies are no longer allowed.

If you have a query related to it or one of the replies, start a new topic and refer back with a link.