Data Partition in Caret Package and Over-fitting

rstudio

#1

I was reading caret package and I saw that code;

createDataPartition(y, times = 1, p = 0.5, list = TRUE, groups = min(5,
length(y)))
I am wondering about “times” expression. So, if I use this code,

inTrain2 <- createDataPartition(y = MyData$Class ,times=3, p = .70,list = FALSE)

training2 <- MyData[ inTrain2,] # ≈ %67 (train)
testing2<- MydData[-inTrain2[2],] # ≈ %33 (test)

Could it be cause of overfitting problem? Or is that using for some kind of resampling method (unbiased)?

I would like to mention that, if I use This code;

inTrain2 <- createDataPartition(y = MyData$Class ,times=1, p = .70,list = FALSE)
training2<- MyData[ inTrain2,] #142 samples # ≈ %67 (train)
testing2<- MydData[-inTrain2,] #69 samples # ≈ %33 (test)

I will have got 211 samples and And ≈ %52 Accuracy rate, On the other hand if I use this code;

inTrain2 <- createDataPartition(y = MyData$Class ,times=3,p =.70,list = FALSE)
training2<- MyData[ inTrain2,] # ≈ %67 (train) # 426 samples
testing2<- MydData[-inTrain2[2],] # ≈ %33 (test) # 210 samples

I will have got 536 samples and and ≈ %98 Accuracy rate.

Many thanks in advance.


#2

times is there if you are doing Monte-Carlo resampling. I would use times = 1 if you are making a simple training/test split.


#3

Thank you very much.