Leaving travis for docker – will I get my life back?

docker
git
travis-ci
reproducible

#1

(Context: This is about using Travis CI/CD to enhance reproducibility of random R projects, not package dev.)

Just this morning, I wasted another hour or so debugging a newly failed Travis build. (Turns someone else already wasted a day figuring out that I needed another system dependency).

This kind of thing keeps happening and Travis CI has become a major productivity sink.

I understand that this is, in some way, inevitable to ensure reproducibility: the failed Travis build indicated that I was relying on some undocumented state on my desktop (said system dependency).

Travis just doesn't make this easy:

  • Even with all the caching bells and whistles the build times can be pretty slow in Travis (>>20mins), especially when using LaTeX. This can really drag out the debugging.
  • Travis (now?) has a debug mode that you can SSH into, but it's kinda insecure (I hear), and it can still be hard to debug an R problem just from the shell of the headless VM you're being dropped into.
  • Travis can sometimes be unreliable (connection timeouts, backlogs or other service disruptions).
  • ...
  • Most unnerving, debugging Travis is duplicate work, because I always additionally need to manage dependencies on my desktop. If you add some (yet subtly different) production environment on top, say shinyapps.io, you're multiplying sources of error.

All this led me to consider Docker again, where, supposedly, I could:

  1. Define all my (system) dependencies in a Dockerfile (using a versioned Rocker image).
  2. Build the image and spin it up on my desktop, doing all of my analysis inside it.
  3. Just to be extra safe, have each commit trigger some (hopefully faster?) CI/CD (say, Google Cloud Build rebuild the image and compile my *.rmd or whatever.
  4. Profit, because I'd now always only solve dependency management problems once.
  5. (As a bonus, I'd have an image which might be more easily deployable/scalable, but that's a different ballgame).

I've raised a related (but broader) topic before, with great suggestions from @cole and @tareef. There's also already a ton of fantastic resources and packages, many of them listed in this thread. It all reads pretty encouragingly.

On the other hand, this RStudio document sounds pretty cautious:

For data scientists, the time between starting a project and writing the first line of code is an important cost. Often dedicated analytic servers outperform containerized deployments by allowing users to create projects with little overhead.

and:

Dockerfiles do not ensure reproducibility. A Dockerfile contains enough information to create an environment, but not enough information to reproduce an environment. Consider a Dockerfile that contains the command β€œinstall.packages(β€˜dplyr’)”. Following this instruction in August 2017 and again in December 2017.

(I think you could go around this by using install2.r and MRAN in your dockerfile? Also, absent good ol' packrat, the same problem exists on Travis).

Proper tooling for an analytics workflow centered on Docker will take an order of magnitude more work than supporting a traditional, dedicated, and multi- tenant analytics server.

yikes.

So, I'm a bit confused, and worried this might be one of those situations:

Some people, when confronted with a problem, think β€œI know, I'll use regular expressions.”
Now they have two problems.

So I'm curious what other people's recommendations and experiences are with this:

Will dockerizing each R project save me time, at equal or greater reproducibility than the usual travis workflow?

(As mentioned at the outset, the concern here is with reproducibility and iteration speed – not pkg dev, deployment or scalability).


#2

I use Docker for R and am a cheerleader, so take it all with pinch of salt, but my answer to your question in my experience is yes it does save time over relying on travis, although I think that is mostly because the Docker image travis uses means you have less control of than your own. My opinion on the points you raised are:

For data scientists, the time between starting a project and writing the first line of code is an important cost. Often dedicated analytic servers outperform containerized deployments by allowing users to create projects with little overhead.

There is a learning curve to Docker, but once comfortable with it my workflow is 1) Start a project 2) Get it working locally in my local environment 3) run containerit on the file 4) commit it all to GitHub 5) Go for coffee whilst first Docker image builds on Build Triggers (now Cloud Build) 6) Deploy to X (presently Kubernetes)

So the time to first line of code is pretty much the same as without Docker.

Dockerfiles do not ensure reproducibility. A Dockerfile contains enough information to create an environment, but not enough information to reproduce an environment. Consider a Dockerfile that contains the command β€œinstall.packages(β€˜dplyr’)”. Following this instruction in August 2017 and again in December 2017.

I disagree with that - you could pin the package version in your Dockerfile but I don't usually bother with that even as each build of the Dockerfile creates an image with a unique label (don't rely on latest tag), and that is my version control. So say I need to rely on dplyr 0.6 and code will break on dplyr 0.7 I will make sure to call the Docker image that was built at that time, which is preserved as an image/tag on Container images. (The Dockerfile is being built on every GitHub push)

Over time I've built up some convenience Docker images derived from rocker with say private repos installed, or other dependencies which can help lower build time (for example installing CausalImpact() can take up to 30mins+ in fresh install, so I have an image with that installed I can use in my FROM for other Docker files). Its all super convenient. I guess I am always deploying into cloud so can't comment if it as smooth for local or your own servers.

But the real bonus on top of the reproducibility has been that Docker is supported and growing everywhere, and stuff like Kubernetes means scaling up R APIs and Shiny apps is all available. The latest development is running containers in the cloud directly without needing a server, so the same Docker files made a couple of years ago for running on my servers now work in those environments with no changes. I think the endgame is upload your code, it detects requirements and builds the Dockerfile for you, then it runs. I also think RStudio is also working more with container support in the nightly builds from what I heard.

So YNMV etc. but I am real happy with it :slight_smile:


#3

Great post! A couple of thoughts:

Exactly. You need reproducibility in the repo or a smarter client (i.e. packrat or something like it), otherwise you are depending on the state inherent in apt, yum, CRAN, etc.. These days, I would actually recommend using RStudio Package Manager, which has support for "MRAN-like" snapshots, but can support doing so for any package (i.e. GitHub packages, some random .tar.gz package, etc.).

To be clear about the problem statement here, you are looking for the reproducibility of a system. There are many solutions to this problem. I think they generally land between Docker and other types of infrastructure-as-code provisioning (Ansible, Terraform, Chef, Puppet, etc.) to make the infra dependable / reproducible.

The problem you are shooting for could potentially also be solved by a better link between R packages and system dependencies, but I am fairly certain that doesn't exist yet. Although it is a known problem, and I'm sure there are folks working on it!

As @MarkeD mentions, I think Docker can be a really nice solution to the problems you present. The quotes you referenced that edge away from docker are trying to be clear that docker is not a hype word that you can just invoke and solve all of your problems. It sounds like you have a more balanced expectation than that: there is a learning curve, but it does have some benefits from the system isolation perspective without requiring infrastructure-as-coding a full VM. It will also require some linux know-how, but using Travis CI already puts you in that category.


#4

thanks so much @MarkeD for taking the to time reply. Also a huge fan of all your cloudyr work and your thoughtful comments on the related thread –– very helpful.

I like to commit (with CI/CD) early and often, and am wondering how that might work. Have you developed your analyses from inside a Docker container running locally (or on GCE, FWIW)?
That would mean that whenever (system) dependencies change, I would (strictly speaking) have to rebuild from the Dockerfile, right?
containerit seems very awesome.

I think I'd go for pinning the package version, or perhaps even using MRAN. I like everything to be reproducible just from the repo, with no other artefacts (container images).

Can't wait to hear more about that.


#5

& @MarkeD

Wanted to drop this here as well, in case y'all haven't seen!

EDIT: You linked to this above. My goof :slight_smile: Or Discourse's... I thought it was supposed to yell at me if I shared the same link again. :laughing:


#6

Thanks so much; that helped me clarify my thinking about this.

Ah, I kinda glossed over this, because I figured that RSPM only makes sense for enterprise customers.

Any chance RStudio might be planning on offering (part of) this as a service? I'm too small a shop to procure / set this up for myself :slight_smile:.
Not sure how this would work, but it would definitely be worth it some money for me to have an easier way to have reasonably solid access to old (compiled?) packages from various sources from a central service.
(This would, after all, solve the awkwardness of storing sources with packrat).

Yes, in this specific case (and more often than not) system dependencies are the problem, though sometimes I'll also muck up package dependencies (having had updated to a gh version of some pkg on my desktop, but not in DESCRIPTION, etc. – the usual shenanigans).

My central pain point is more general however; I feel like doing all this twice (plus production environment) is a waste. That made me think about docker, because I had this idea that I could spin up the exact same image locally (or run it from the cloud, if I'm going doing something computationally expensive), and skip all this duplicate work.
If this were somehow more deeply integrated in RStudio, btw, I think that would be fantastic.

Uh yeah :frowning:
I do like the process of shinyapps-package-dependencies; I was wondering whether that could be scaled to uses!


#7

I commit a lot too, and each time it builds in the background. Its a good check to see if anything has broken, such as you introducing a new dependency without adding it to Dockerfile again.

Developing inside the container is possible, and then freeze-framing the container as is with the packages you may have installed (and logic behind what gce_push_registry() allows you to do if you set it to a running container), but I tend to only use Docker to deploy "finished" apps so find it easiest to use my local session and create versions as needed.


#8

Very interesting discussion ! @maxheld83 thanks for bringing all that up !

Just some thoughts based on my experience after reading through the thread.

On this, I would recommend to look at a Rhub service : sysreq API and its companion :package:. Helps you know what are the system requirement for a package. I help greatly in some cases know what is missing.

Also, to share experience on this , we use Ansible playbooks to recreate environment in our system. This allow us to set up exactly the same environment for R code in different machine. One of the playbook is for installing some R packages and their known system dependencies. For binary repository and versioning, we use Nexus Sonatype, in its OSS version. It helps us take snapshot of what we need.
We use it also for R packages through a community R plugin - it does not have the full pack feature as RSPM but this open source version does the job correctly. The ability to create per project repo is a way to isolate package version.

You mentioned packrat: In some cases, its bundle feature (keeping the src package with the project) have helped me for deployment. Easy restoration, if you have dealt with system requirement otherwise.


#9

sysreq is what containerit uses to build the Dockerfile. It is magic.