This topic represents a pretty hard topic and one that I personally think a lot of experienced data scientists gloss over and take for granted when explaining to more fresh data scientist.
My preferred method is to version control code and use a production (master), dev, feature branch strategy. By that I mean each iteration of the model goes into a new feature branch. When I'm satisfied with the model change, I'll push it to the dev branch where it will get A/B tested against the version in production. If the dev branch proves superior over time, it will replace the production branch. If not, the next feature branch will just replace that dev branch.
By using this branching strategy, I'm by default versioning my models, though I do also include a description file that has a version number that gets incremented on each update. For example, production version might look like
apiVersion: 1.1.2 and dev version like
apiVersion: 1.2.300 where the extra digits allow me to distinguish major updates, minor updates, and slight variations. I use that version naming strategy because it matches the rest of the software engineering team. I typically try to serve models as a RESTful API that just wraps
predict(model, new_data) where model is a loaded
.rds binary and new data is the JSON submitted to the API. For this aspect I'm using the
plumbr package and running the R code through a docker container.
I log all prediction requests/responses regardless of whether a decision is being made from them. This includes the model version that made the prediction. For example, another app might be using the production branch to get predictions, but I'm also passing that same request to the dev branch so I can compare the responses even if the dev branch predictions aren't going anywhere but the logs. If the dev branch shows promise I might start to actually direct traffic to it. This may sound hard to setup, but it's actually pretty easy in practice thanks to modern web servers like nginx and the concept of load balancers.
I try to save the training data files in cloud storage, but sometimes I just having the SQL used to get the data (which is risky since the same query ran at different times could result in different data, though not as risky if querying append only database tables).
We have a whole other setup at my company that basically implements a predictive modeling DSL (domain specific language), which basically boils down to having a config file that you pass in the model features (variables) and how to build them. The DSL then gets compiled and ran in the JVM. It makes it possible to mix complex algorithms with simple ones like rule based systems simultaneously, while also automatically providing the tooling for deployment and monitoring. It also means you only need to justify why the feature you're adding benefits the model for development. That said, it's kind of a black box system and can be pretty difficult to understand what happened since it uses some automated machine learning to build the final model.
There are now even commerical applications that work kind of similarly, like DataRobot, and I think they will become pretty popular in Enterprise over the next 5 years.
In my experience, machine learning projects benefit from the wisdom in the Zen of Python (aka
Explicit is better than implicit.
Simple is better than complex.
Complex is better than complicated.