Question on Split ratio for basic logression and CART models

Hi, I'm working on some data on heart failure clinics in order to predict death event numbers. I'm relatively new, so I'm confused on some of the concepts on RScript.

I am going to make a Logression model and CART models (decision tree and randomForest), I was wondering on how to find the best split ratio for my data? Or is this number purely arbitrary?
Additionally, should the split ratio remain constant across all three models, or is there an optimum number for each of them?

Thanks in advance

Hi @MayDallow,

Training/testing split ratios are pretty arbitrary. They are usually chosen to provide a sufficient size of data to train a model with low over-fitting, and a sufficient test size to evaluate the overfitting. This is something that is best decided on by knowing the data structure and shouldn't really be considered a tuning parameter. In other words, its something based on human decision, rather that trying out many different splits to find the best model. This sort of data-drive splitting with more than likely lead to poor external generalizability for your model

If you want to compare the model performances to one another to see which model is "best", then it makes sense use the same training and testing splits for the various model types. Hope this is helpful.

Hi @mattwarkentin,

Appreciate the response, I will bear this in mind and make the necessary changes to my model.

Thanks again

1 Like

This topic was automatically closed 21 days after the last reply. New replies are no longer allowed.

If you have a query related to it or one of the replies, start a new topic and refer back with a link.