Thoughts and tips on organizing models for a machine learning project


#1

I'm curious how the community manages their data science projects when tasked with a machine learning problem that requires multiple iterations of a single model.

  • Do you use a specific folder structure in which you keep your iterations?
  • Do you keep every iteration or just a few key ones?
  • Are you using any model management packages in R to do so?

Love for anyone's insight!


#2

Now I haven’t come up with a great way that I feel satisfied but this is what I am currently doing.

After each model run, I store every model as a .rda file with a text file of the same name that stores the detail information about the run. Then I just use GitHub to version control it all in case I have to go back to it quickly. This doesn’t keep track of the actual code and when using caret to run multiple packages (xgboost, mlr, etc).

As a platform, I’ve found that the workflow that Domino provides makes it really easy to track a project. Each run of a model is separately divided into its own container making it easy to track and I’d love to either come up with some ideas that replicate some of the functionality on my own machine.


#3

I’ll explain what I’m doing on my current project, but it’s definitely a workflow in progress, so I would love to hear what others are doing.

Each model gets its own dedicated script in the code/ folder of my project. Examples might include 32_rf_class_weights.R, 33_rf_no_class_weights, and 33_svm.R. There is also a 30_prep_model_data script.R that generates the data used in each of these models.

There might also be a 40_prep_model_data.R script that generates different training/testing data, along with model scripts that start with 41_ or 42_.

I use caret to train a model caret and then save the “train” object to an RDS file that lives in data/models/, and has the same name as the script that generated it, like 32_rf_class_weights.rds or 33_rf_no_class_weights.rds.

The code is versioned, but the models are not. I’m curious about other people’s take on whether this is a good idea or a bad idea.


#4

In my case, I’m running the same models many times with slight parameter changes on variable transformations and don’t need a record of each run.

I like to store my model parameters in a .csv file. I have a set of functions that deal with/build models based on the parameters file/dataframe. It works well for me because I can keep notes in the .csv file next to parameters (why this transformation vs others, why there are outliers, etc). This helps me remember why I did something and convince people that i really know their data. Whenever I have a version of a model I like, I save a copy of that .csv file in a log folder (name =model_date.csv). This allows me to recreate that model based on parameters or go back to whatever I had yesterday when I realize today’s model got worse instead of better …

The functions are under version control and i’m keeping all the model details in a csv so the code to actually run a model is really generic and I don’t need to worry about logging it. Pretty basic, but it works well for me with minimal effort

Between updates and before data changes I save my model list object as well (has parameters, data, transformed data, model stats and outputs, etc) which is useful for comparing implications changing the data or adding more data has on a model.


#5

This topic represents a pretty hard topic and one that I personally think a lot of experienced data scientists gloss over and take for granted when explaining to more fresh data scientist.

My preferred method is to version control code and use a production (master), dev, feature branch strategy. By that I mean each iteration of the model goes into a new feature branch. When I’m satisfied with the model change, I’ll push it to the dev branch where it will get A/B tested against the version in production. If the dev branch proves superior over time, it will replace the production branch. If not, the next feature branch will just replace that dev branch.

By using this branching strategy, I’m by default versioning my models, though I do also include a description file that has a version number that gets incremented on each update. For example, production version might look like apiVersion: 1.1.2 and dev version like apiVersion: 1.2.300 where the extra digits allow me to distinguish major updates, minor updates, and slight variations. I use that version naming strategy because it matches the rest of the software engineering team. I typically try to serve models as a RESTful API that just wraps predict(model, new_data) where model is a loaded .rds binary and new data is the JSON submitted to the API. For this aspect I’m using the plumbr package and running the R code through a docker container.

I log all prediction requests/responses regardless of whether a decision is being made from them. This includes the model version that made the prediction. For example, another app might be using the production branch to get predictions, but I’m also passing that same request to the dev branch so I can compare the responses even if the dev branch predictions aren’t going anywhere but the logs. If the dev branch shows promise I might start to actually direct traffic to it. This may sound hard to setup, but it’s actually pretty easy in practice thanks to modern web servers like nginx and the concept of load balancers.

I try to save the training data files in cloud storage, but sometimes I just having the SQL used to get the data (which is risky since the same query ran at different times could result in different data, though not as risky if querying append only database tables).

We have a whole other setup at my company that basically implements a predictive modeling DSL (domain specific language), which basically boils down to having a config file that you pass in the model features (variables) and how to build them. The DSL then gets compiled and ran in the JVM. It makes it possible to mix complex algorithms with simple ones like rule based systems simultaneously, while also automatically providing the tooling for deployment and monitoring. It also means you only need to justify why the feature you’re adding benefits the model for development. That said, it’s kind of a black box system and can be pretty difficult to understand what happened since it uses some automated machine learning to build the final model.

There are now even commerical applications that work kind of similarly, like DataRobot, and I think they will become pretty popular in Enterprise over the next 5 years.

https://www.datarobot.com/

In my experience, machine learning projects benefit from the wisdom in the Zen of Python (aka import this)

Explicit is better than implicit.
Simple is better than complex.
Complex is better than complicated.


#6

I think there’s actually no golden recipe that works here but in most cases there are a couple practices that help a lot to organize your workflow efficiently:

  1. I typically write separate scripts for different stages of the project: data download and basic prep, EDA, data transformation and feature engineering, model training, model evaluation. That way individual code pieces are not too large and are easy to manage especially when you realize in later stages of the project that you need to save more objects to test certain hypothesis.

  2. I keep names consistent across the entire workflow spectrum - in a very similar way to what @alexilliamson mention before. The basic logic would follow {stage}{action taken}{model type}.

  3. I apply the basic rules of the tidyverse: I always try to reuse the existing structure of a dataframe with listed column types where you can store your data, preparation methods, models, models results etc. This also allows you to embrace functional programming with purrr and write as little code as possible which doesn’t result in an explosion of object names. In an ideal scenario you end up with having only one dataframe that contains all your modelling endeavors to which you append additional rows in case you would like to test another hypothesis/ run a model.

I took me a while to arrive at a consistent approach of using the tidyverse and caret and I strongly recommend you guys read this post: https://rsangole.netlify.com/post/pur-r-ify-your-carets/. That will make your life much easier and modelling tasks more efficient :slight_smile:


#7

Thank you for this post! I hadn’t come across this nugget.


#8

Thanks @raybuhr! I really appreciate the thoroughness of your thoughts.

Do you use these models in production as well?

Can you suggest any resources that helped you obtain these results? I’m really curious how the routing has been setup. I might be completely wrong, but I assume you load balance between two instances of nginx (which serves the two version of the API) to accomplish this task?


#9

@cdr6934

Yeah, dawg! Models in production is the main point of building the model, right? As long as the model can predict a single request quickly and serve the results over the network, my team is happy. For the most part, that means I can use whatever I want to train and build machine learning models. I just have to be careful to preprocess data, have consistent datasets, and not retrain the model on every request. Once I’m done training, I save the model as a .rds file in cloud storage, which gets built into the Docker container and thus can be loaded into the R plumbr server and exposed to a http port.

The load balancing stuff is automatic through the platform (devops) team, but it basically works just like using a server running nginx to redirect traffic to other different servers. We can change the percent of traffic going to each server. In our platform, we use kubernetes to manage all this and I’m not sure on the desk details. For a simple sto setup, I think this is a good starting place:


#10

I recommend using make to manage the model execution part. You can make the R files and the data dependencies so that, if either are updated, make will recreate the model.

It’s ugly (and old), but below is some code I’ve been recycling to build a series of models on a data set. It assumes

  • all of the R files in the current path will be used to create a model (except for make.R),
  • each file outputs an RData file with the same name as the R file, and
  • unix or OS X are being used.

You can use make -i -j # to prevent stoping on errors and to run # files at the same time.

# file: make.R
R_files <- list.files(pattern = "\\.R$")
R_files <- R_files[R_files != "make.R"]

RData_files <- paste0(R_files, "Data")

###################################################################

## Break the files out into threes so that the make line isn't
## really long
over <- length(RData_files) %% 3
out_names <- if(over > 0) c(RData_files, rep("", 3 - over)) else RData_files
deps <- matrix(out_names, nrow = 3)
deps <- apply(deps, 2, function(x) paste("\t", paste(x, collapse = " "), "\\\n"))

make_depend <- paste(deps, collapse = "")
make_depend <- substring(make_depend, 3)
make_depend <- substring(make_depend, 1, nchar(make_depend) - 2)

make_operations <- 
  paste0(
    RData_files, 
    ": ", paste("dataset.RData", R_files), " ",
    "\n\t @date '+ %Y-%m-%d %H:%M:%S: starting  ", R_files, "'",
    "\n\t @$(RCMD) BATCH --vanilla ", R_files,
    "\n\t @date '+ %Y-%m-%d %H:%M:%S: finishing ", R_files, "'\n\n"
  )

cat(
  paste0(
    "SHELL = /bin/bash\n",
    "R    ?= R \n",
    "RCMD =@$(R) CMD\n",
    "all: ",
    make_depend,
    "\n\n",
    paste0(make_operations, collapse = "")
  ),
  file = "makefile"
)

I wouldn’t be surprised if there is an R package that does this better.


#11

If you’re looking for an R package, I’ve heard people recommend remake.

I don’t use make so I can’t personally speak to it, but I might check it out after reading your recommendation. Thanks for sharing!


#12

Thank you for the info!

Absolutely! I’ve just met a few people recently who have taken their R models and converted them over to a python implementation due to the speed and volume they were not able to get with the R models they created. Really more curious than anything.


#13

Thanks! Never used make so I’ll be looking into this further as it is a new way of doing this!


#14

Curious about what they were doing specifically that made Python faster… In my experience, neither is typically much faster than the other unless you don’t profile your code or look for opportunities to make quick wins.

For example, if you just use the randomForest package in R, you aren’t parallelizing the training, which is easy to do with python’s scikit-learn package. There are parallel random forest implementations in R, for example ranger, but for some reason people often don’t seem to try and do much research before they decide R is slow and move away.


#15

@raybuhr

some reason people often don’t seem to try and do much research before they decide R is slow and move away.

I agree and I think that it is mostly because of the large number of options. scikit-learn is very canonical and well organized. It lacks the diversity of R’s offering (in terms of number of models available) but at least you can find things easily. The heterogeneous nature of R lends itself to disorganization. I would wonder how many people have seen the CTVs for example.

I see this, above almost all else, as being the main issue facing R. I’m an old S person and know my way around but I must have started coding a function a dozen times before I look around and find that there is already an implementation.


#16

One of the implementations was for a population health application where the company was scoring hospital readmission in the behavioral health space. What I could gather was they sheer volume they were piping into the R models didn’t keep up in production, thus hiring software developers to port those models into python using scikitlearn and the expertise software engineering background to parallelize the process. So what I could gather, the issue was really due to the skills.

The developers were unfamiliar with R and the data scientists were unfamiliar with production software engineering.


#17

@Max totally agree on cran task views! They are awesome, I have them bookmarked for quick reference and often read through them before working on new projects. I also recommend them to literally everyone who talks to me about how to get good at programming in R.


#18

@cdr6934 I’ve gone through that exact use case before as well. I personally still feel it sometimes at my company now. There’s a huge gap between programming for data analysis and writing performant, maintainable, reliable code, and most R programmers I know are far better at the former than the latter (myself included).

However, I don’t think that means the data scientists should just toss out the R code and let the engineers take over. Data scientists need to own their models in production. That’s not the same as being responsible for the prototype. I know it’s popular to hand off models to engineers, but there’s often sacrifices made in terms of understanding and implementation when doing so. Instead, data scientists need to pair program with the engineers to build more performant, reliable code. Even if the engineers don’t know R, it’s reasonably easy to read/figure out if you do know python… and 100 times easier if you’ve got the R programmer who wrote it sitting next to you as you go through it together.

Sometimes it hurts getting told your R code is fragile or slow or ugly. Instead of giving up and letting the “experts” try it on their own (probably missing key assumptions along the way), we need to be resilient and work together to address the issues and become better programmers.


#19

Hi! awesome to find this thread as this specific need/concern keeps on top of my mind, and still to know a productive solution for this...
And I've been researching a lot :slight_smile: , so would be very fond of working on this concepts &thread.

Not specifically for production tracking models, but on dev/prototype phase, how to proper save/track/compare/evaluate different results in the ml project, for each possible path on the dag, tend to see ml projects as a groups of dags (pipeline, params, algorithms, hyperparams)?

Why are teams wasting so much cpu training things that get "lost"? never properly compared/evaluated? never shared with the team?

When anyone reaches a model result, kind of be able to just do track(model,any_additional_metadata_I_may_add) to team shared storage (for me, it has to be that dead simple... see openml publish), it would be able to track final result from the dag path, adding each node params/parent node params, something like that. So everything can be comparable. And reload/compare whenever needed.

small experiment just to illustrate:


(not persisting any to filesystem though, now I would like something to track the dag result nodes, typically ,this served just to ask on twitter if there are similar packages already, Steph Locke shared recipes with me, which is amazing note: python has https://github.com/scikit-learn-contrib/sklearn-pandas)

Also why can't we jointly compare R/python models? At least results/prediction wise should be possible, openml/mlr actually has very good concepts here (namely concept of agnostic machine learning task, as the root node in the dag). Only thing we would need a kind of private/team openml server, ex: one for project? More favor filesystem based storage (no server), ex: csv for metadata, predictions, datasets, resample fold info, binary only for actual models. So mostly everything could be reused R/python/others. (ex: just start a docker image on the results folder to get an model UI eval tool, like openml)

Ideas? Does this already exist? Thanks for the brainstorm!
Rui
ps-some references
https://mitdbg.github.io/modeldb/
http://modeldb.csail.mit.edu:3000/projects
https://www.dataiku.com/



https://kaixhin.github.io/FGLab/


Ideas, tips & packages: model tracking, persistence, model db/store with pipeline params/metadata, reproducibility?
#20

Could you slightly rewrite this reply, so that it stands as a stand-alone, new topic?
This is an old thread, so it's probably better for you to just reopen this conversation anew, focused on your set of questions.

I'd put this under the ML category.