Help with K-fold cross validation for CART- classification model

nahalh · January 14, 2019, 3:09pm

Hello, I'm trying to use 10-fold cross validation for CART model (classification). Here are my questions:

Do I still need to split my data set when I'm doing cross validation?
If the answer is yes to the question 1, we usually run cross validation on 'Training' data or 'Test' Data to get the best output model?
I need some helps with my codes: I don't know how to specify "data" here:

##cross validation

controlparameters <- trainControl(method = "cv",
number = 5,
savePrediction = TRUE,
classProbs = TRUE)
controlparameters

Thank you

john.smith · January 16, 2019, 1:00pm

Hi @nahalh

I'm open to correction from some of the other more experience people of the community but from my understanding, cross validation is used to avoid the model over-fitting. If you use 10 fold cross validation, the data will be split into 10 training and test set pairs. For illustration lets call them samples (I'm actually borrowing the terminology from @Max and his resamples package).

So you have 10 samples of training and test sets. The training and test set should be representative of the population data you are trying to model. Lets say you have a dataset for credit card fraud for a bank. Your dataset should have the same properties as the entire bank dataset.

Taking the first sample pair the model is trained on the training portion and then it is applied to the test set
A metric is then recorded to see how well the model does. The goal of the modeling task, should inform you as to what metric to use as different metrics have different properties.
Once the metric is recorded the second sample trains the model and then it is applied to the second test set and again the metric is recorded.
If you are using a decision Tree which is highly variable, you may want to repeat the process up to ten times to get a stable performance metric
You can then take the mean across all performance metrics

Given all this, cross validation should be implemented on the training data. The test set can be used for interpretation.

Your question on if you actually need to have a separate test set overall is a good one. Depending on who you read and how much data you have determines if you should split the data into train and test. Some papers/blogs say that splitting the data into train and test set isn't ideal as the test set might not be representative. Just personally, if i have enough data i split the dataset into train and test set and then apply cross validation to the train portion. I then apply my model to the test set and use the results for interpretation (if that's my goal)
If you have huge amounts of data and data collection is cheap and plentiful, it might be OK to just use a train and test set

A much better explanation for re-sampling than the one i have given

http://www.feat.engineering/review-predictive-modeling-process.html#resampling

Finally if you are using R the below link should show you how to put your data into a caret function like you have shown

https://machinelearningmastery.com/how-to-estimate-model-accuracy-in-r-using-the-caret-package/

Hope its useful

nahalh · January 16, 2019, 3:34pm

Hi @john.smith

Thank you for the clear explanation. My data set has 3344 samples and three variables. I think it makes sense to split the data to 80% training and 20% testing data sets. I'll read the materials that you shared with me and will back to you with question if I have any.

Thanks,

Nahal

system · February 6, 2019, 3:34pm

This topic was automatically closed 21 days after the last reply. New replies are no longer allowed.

If you have a query related to it or one of the replies, start a new topic and refer back with a link.