Splitting Data on rsample::initial_time_split

Hi,

I have a data frame which i wish to split using the rsample::initial_time_split. I was wondering if someone could help with the implementation below. I have created a data set where a date can appear more than once. The data set is made a sub sample of my original data so i am unable to give you the actual table due to privacy reasons.

There appears to be an overlap date wise between the train and test set where the dates are visible in both extracts. I would like a clean cut (if possible).

library(rsample)
library(dplyr)

# Split the Data based on time slices
test <- mydf %>% 
  sample_frac(0.01) %>% 
  mutate(date = custom_date) %>% 
  arrange(date)

uv_lag_split <- initial_time_split(test)
train_data <- training(uv_lag_split)
test_data <- testing(uv_lag_split)

c(max(train_data$date), min(test_data$date))
# [1] "2021-12-28" "2021-12-28"

unique(train_data$date) %>% tail()
# [1] "2021-12-23" "2021-12-24" "2021-12-25" "2021-12-26"
# [5] "2021-12-27" "2021-12-28"

unique(test_data$date) %>% head()
# [1] "2021-12-28" "2021-12-29" "2021-12-30" "2021-12-31"
# [5] "2022-01-01" "2022-01-02"

Thank you very much for your time

Do you have duplicate rows for 2021-12-28?

Hi Max,

Yes, I do. They are different events that happened on the same day. So unique records but the same day i am using for my split
I guess its by design?

Thanks

I think that I would get the distinct dates, run initial_time_split() on that data, then use the results to join to the original data. I don't think that you'd want the same date in both the training and testing set.

1 Like

This topic was automatically closed 7 days after the last reply. New replies are no longer allowed.

If you have a query related to it or one of the replies, start a new topic and refer back with a link.