I'm open to correction from some of the other more experience people of the community but from my understanding, cross validation is used to avoid the model over-fitting. If you use 10 fold cross validation, the data will be split into 10 training and test set pairs. For illustration lets call them
samples (I'm actually borrowing the terminology from @Max and his resamples package).
So you have 10 samples of training and test sets. The training and test set should be representative of the population data you are trying to model. Lets say you have a dataset for credit card fraud for a bank. Your dataset should have the same properties as the entire bank dataset.
- Taking the first
sample pair the model is trained on the training portion and then it is applied to the test set
- A metric is then recorded to see how well the model does. The goal of the modeling task, should inform you as to what metric to use as different metrics have different properties.
- Once the metric is recorded the second
sample trains the model and then it is applied to the second test set and again the metric is recorded.
- If you are using a decision Tree which is highly variable, you may want to repeat the process up to ten times to get a stable performance metric
- You can then take the mean across all performance metrics
Given all this, cross validation should be implemented on the training data. The test set can be used for interpretation.
Your question on if you actually need to have a separate test set overall is a good one. Depending on who you read and how much data you have determines if you should split the data into train and test. Some papers/blogs say that splitting the data into train and test set isn't ideal as the test set might not be representative. Just personally, if i have enough data i split the dataset into train and test set and then apply cross validation to the train portion. I then apply my model to the test set and use the results for interpretation (if that's my goal)
If you have huge amounts of data and data collection is cheap and plentiful, it might be OK to just use a train and test set
A much better explanation for re-sampling than the one i have given
Finally if you are using R the below link should show you how to put your data into a caret function like you have shown
Hope its useful