I’m sad to say I’ve made a few attempts to start bringing tidymodels into my work, but it’s been a frustrating and disappointing experience. I’ve used caret for years and while it has a few flaws perhaps, I continue to find it much simpler and intuitive.
I hope that my criticism is constructive. While I haven’t been able to develop much expertise in tidymodels, I can offer some feedback.
I make this post not to complain but to try to understand if I the only one who feels this way. Is there a dialog about whether tidymodels should be the dominant supported platform for ML and cross validation in R? Or whether this is a good way to teach modeling in R? Sorry if these questions seem inflammatory, but I’ve been along for much of the ride with R and the tidyverse, made a whole career on it, and it’s been amazing. This is the first time I’ve felt left behind or had so many questions about the evolution.
The recipe combines both a formula and data preprocessing. I can’t see the benefit of combining these two mostly independent things. Recipes also abstracts the pre-processing in a potentially dangerous way and streamlines something that in my experience always requires a custom treatment. Pre-processing is something I have to collaborate on with my clients, and always requires something specific to the problem. (Outlier screening within subgroups, etc.). dplyr is an amazing tool for pre-processing and I don’t think it makes sense to create a more limited and abstract alternative.
The delayed execution of the recipe and the prep and the juice in my opinion just makes it more difficult to work with and inflates the number of functions a user has to juggle...
tidymodels simply requires too many functions. It’s a burden to have to keep them all in your head and it’s difficult to understand what they do individually. Many of these functions set_engine, set_mode, set_args just seem more naturally arguments than functions, like they are in caret.
When the operation of creating a tidymodel is spread across so many functions, it is difficult to consult documentation for help. Contrast this with caret::train. You may need to refer to the caret::trainControl documentation as well, but between those two functions and their documentation, you have the whole workflow right in front of you. With tidymodels, the operation is spread across so many functions, I have to keep them all in my head, perhaps look up lots of documentation, try to wrap my head around what options are available. I find myself reading the vignettes again and again to understand the workflow. It’s just too scattered. It doesn’t stick in my head. I would struggle to have confidence that I see all my engines and options correctly.
tidymodels is taught with pipes and I love pipes. But when the intermediate output between the pipes is abstracted, the sequence of steps feels like something you just have to memorize rather than a coherent sequence of individual operations.
What’s the alternative? Well, I do understand what tidymodels is trying to do, and there is a need for a tool that can handle train test split, cross validation, best model selection, prediction on test set...the whole workflow. I just think that the workflow in tidymodels is too abstract and scattered across too many new functions to learn.
And there is one important thing missing from both caret and tidymodels: so many times, I have to create not just one model, but dozens. Different response variables, different cross validation strategies, different pre-processing. With a tibble and map functions, it’s possible and extremely powerful to create one row per model, columns for test set, train set, trControl, train settings (method, tuneLength, ...). Probably many of us have our own version of this approach. I just think there’s an opportunity for a new package to bring the model workflow into tibbles with purrr functions and modelr and caret (leveraging the existing tidyverse capabilities) without having to learn a new abstract framework.