Machine Learning Workflow?



R markdown has been amazing in organizing my work and packaging it for analysis-type projects. With GNU Make, packrat, and a specific folder structure the whole thing is reproducible and just so efficient to work with.

I struggle a bit with machine learning model development projects because they typically have many parts and are quite circular. Also, there are two competing packages caret and mlr for which there are overlapping functionality.

How do you structure a machine learning project and make it reproducible (ignore the productionization as I think that part already has a lot of work around it).


So, just to make things slightly more complicated before anyone tries to break things down :wink:, I thought I’d point out the CRAN Task View for Machine Learning. The one for Reproducible Research might be relevant as well. Why task views? Well, lots of packages have overlapping functionality. I don’t think that’s unique to ML.

You may also want to look at Max’s package, recipes:

I’m not an expert in the area, but it allows you to construct a blueprint of sorts (a recipe, if you will), which (depending on your aims) can go a long way toward helping with reproducibility and/or communicating what is and is not reproducible, depending on the model you’re using.


Also, there are two competing packages caret and mlr for which there are overlapping functionality.

There’s probably some bias on these forums for the former =]

So I would have questions before answering…

  • Will you need to update the models on a regular basis as more data rolls in?

  • How much do you need to document the models and the modeling process? Is this under regulatory oversight?

  • Where does the data come from (the golden spreadsheet, a database, etc).


In order:

  1. I think a big benefit of having a workflow is that you are able to swap data and retrain the model without much effort. Given that in practice I’ve seen the constant need to retrain the model, there is that need.

  2. Documentation is key. With GDPR, documentation of the models is necessary.

  3. I think the extraction part can be taken care of by haven, DBI, or readr, so the data source isn’t as important (?)


Recipes is beautiful! I’ve already seen this and love the idea, but am waiting for caret to become tidier and more integrated before diving in. Strangely, mlr right now is more amenable to tidy analysis / pipes.


I was asking because some paths would lead me to use make for everything. So if the source data were in flat files and would be updated periodically, a dependency-based system would probably be best.


Well you can’t get less tidy than caret =]

I’m building a set of packages that will be more modular and will work together like rsample, yardstick, and tidyposterior. The api’s are fairly low-level right now (see those pkgdown sites) but after the next package, it will be a lot easier to have tidy high-level interfaces so it’s coming along.


Ok my mind is officially blown. I’ll be patiently waiting for this to substitute mlR! Thank you for responding, I’ll keep using the packages as they get updated and seeing how my use cases fit. This forum is probably best way to provide feedback?


No, probably the individual packages.


:raised_hands: this is really great news @Max! Excited to see how the more modularized packages unfold.