Splitting data into training and test data

Hi there,

Just for context, this is the question that I am struggling on "Split the data into training and test data: test data will contain observations from year 1986 and 1987. Rest of the observations will be a part of the training data. " How would I be able to do this? I can't seem to find anything that would help me to make sure my test data contains only the observations from my data set that is in year 1986 and 1987 and the rest of the observations in the training data. Year is one of my variables. Help would be much appreciated!

Hello,

first of all, see our homework policy in the forum. However, there are numerous ways to do this. You could either use base R functions like %in% to specify a range of Dates and subset based on this range. Or you make use of mroe advanced packages likes lubridate and dplyr or even data.table and collapse, which offer specific functions to handle the conversion of dates between numerous formats.

Without representative sample data (e.g. a reprex, see here for more informations) you will not get any additional informations or even working code I believe, since it would be against the homework policy..

But maybe the suggested base R method or the packages I mentioned are enough to get you started, since subsetting is a rather basic task which should be rather easy, compared to advanced machine learning techniques.

Kind regards

I understand. I just want to split my data into training and test. I also want the test data to show all observations for the following years only: 1986 and 1987. Training data will just show the rest of the observations.

What have you tried so far by yourself?

This is what I have tried by myself so far.

set.seed(123, sample.kind = "Rounding")
index <- createDataPartition(y = Crime$year, p = 0.7, list = F)

train <- Crime[index, ]
test <- Crime[-index, ]
nrow(train) / nrow(Default)

Just to mention, crime is the name I've given to my data set and year is the variable name of year. I feel like I am on the right track, but I just can't get for my test data to show observations only from the year 1986 and 1987 and training data to show everything else.

Alright, so I made up some sample data for demonstration. You are overcomplicating it - using base R is enough here. The caret function createDataPartition gives you a random subset - but you have specific conditions so why not just use them with base R?

Crime <- data.frame(
  year      = sample(1980:2000, size = 5000, replace = TRUE),
  dep_var   = runif(5000, 100, 200),
  indep_var = rnorm(5000, 10, 2)
)
head(Crime)
#>   year  dep_var indep_var
#> 1 1996 126.3860  9.543301
#> 2 1981 179.7279  7.656707
#> 3 1988 188.5838  9.327195
#> 4 1992 167.6224  6.007985
#> 5 1999 183.8337 10.034882
#> 6 1983 185.0700 12.727204
# check, how many rows correspond to 1986 and 1987
nrow(Crime[which(Crime$year %in% c(1986,1987)),])
#> [1] 488
# split, to have a subset of 1986-1987 data
index <- which(Crime$year %in% c(1986,1987))
test   <- Crime[index,]
train  <- Crime[-index,]
# same as above
nrow(test)
#> [1] 488

Created on 2022-11-12 with reprex v2.0.2

The index above checks for all rows, where the year is inside [1986,1987]. If you have actual dates, you should play around with data.table::year() or related functions to get the desired result.

Kind regards

You...my friend....I shall remember you when I become rich and I am not joking. Thank you so much for helping me out, I understood where I went wrong and it's giving me exactly what I need. I genuinely spent 4-5 hours trying different videos and commands, but you made my life so much simpler. Thank you, thank you and thank you.

Glad I could help you tackle your problems :smiley:

This topic was automatically closed 7 days after the last reply. New replies are no longer allowed.

If you have a query related to it or one of the replies, start a new topic and refer back with a link.