Hi,
You still did not provide with with a Reprex I could work with, but here is a general example:
library(caret)
library(dplyr)
set.seed(1) #Just to make sure the outcome of random functions is reproducible for this example
#Create some data with outcome 0 - 1 (80% 0 , 20% 1)
myData = data.frame(x = 1:100, y = runif(100),
outcome = sample(0:1, 100, replace = T, prob = c(0.8,0.2)))
#Split the data into two sets (70 - 30%) keeping the outcome distribution
dataSplit = createDataPartition(myData$outcome, p = 0.7, list = F)
#Assign traning and testing set
trainingData = myData %>% slice(dataSplit)
testingData = myData %>% slice(-dataSplit)
#Check the distribution of the outcome
sum(trainingData$outcome) / nrow(trainingData) # % 1 in training
[1] 0.1857143
sum(testingData$outcome) / nrow(testingData) # % 1 in testing
[1] 0.1666667
Take note that the first argument of the createDataPartition
function takes a vector, and not a data frame (in this case myData$outcome). It ensures then that the two datasets it creates have roughly the same distribution as seen in that vector.
Hope this helps,
PJ