Collaborating on data science projects

pedram · December 13, 2017, 3:08pm

This isn't a question specifically about RStudio/R, but rather on data science projects more generally. What sort of practices have you found are critical when working on projects as a team, rather than as an individual? How do you structure projects, define roles and responsibilities, segregate tasks, etc.?

My boss and I were working on a project together which quickly became just him working in tunnel-vision and delivering something that didn't end up working. To his credit, he recognized the failure and now we're trying to brainstorm how to better collaborate on data science projects, and it's not intuitive for either of us, as we're both used to delivering from start to finish.

nwerth · December 13, 2017, 4:10pm

A few colleagues and I are exploring tools for collaboration in our unit, and here's our experience:

Organization
- We're following the general structure from Project Template (http://projecttemplate.net/architecture.html), but not using the package.
- If your group has more experience with R, a package might be the best way to organize. You'd also end up with an easily-shared final product.
Version control
- You absolutely need this. Bonus if you're using a system that can merge or undo changes. I suggest Git. Not the simplest system, but it is the most prevalent. And there's plenty of learning material online.
Tasks
- We use Visual Studio Team Services to host our Git repository, but any professional code management software would do (GitHub, Git Lab, Bit Bucket, etc.). Those should have a place to create and assign tasks/issues.
- With the previous, require all changes be done via pull request and be reviewed. Nobody should be changing the code without the other person having a say, even if that say is just "I trust you, go ahead."
Working product
- Adopt a sprint mentality: meet regularly (e.g., every two weeks) to show the most recent "working" version of the project. It's not done, but it should show where you're going. Have somebody else (ideally the person who'd use the final product after all this) sit in the meeting and give their feedback. This keeps you from going too far down a path because you want to when nobody else cares.

raviolli77 · December 14, 2017, 1:11am

I help lead the Data Science org at UCSB and generally we take the approach of team driven projects.

For the structure of the project/repo we use the Cookie Cutter Template, although we've created a simplified version of it.

We've picked up on Agile Team Management so we do things like sprints and stand ups while meeting weekly.

We have them create a general outline which can be changed iteratively because often people won't know what the project will look like but keeps them accountable.

We also introduced the concept of milestones, which are simple outlines that help show us and them the progress they've done for each week or meeting time. Here's a simple example, this helps tie the outline together, while having someone be the person in charge of making sure stand ups and blockers are addressed.

Its a learning process but we think we've reaching a timeline we are comfortable with and the biggest hurdle we face right now is the finished product which we're learning more on how to make sure that is satisfactory.

nwerth · December 15, 2017, 4:15pm

Also, remember the guiding axiom of any system: try new habits, but only keep the ones that help.

hao_ye · December 15, 2017, 7:35pm

I'm still working on a template for my current lab, but one tip I've picked up recently is to use the here package to define paths. Jenny Bryan had a writeup that I found useful here.

In essence, it solves the problem that in RStudio projects the default working directory is the project root, but in Rmarkdown files the default working directory is the folder that contains the .rmd file. So this helps to set paths that work regardless of whether the code is in a .R or .rmd, and also allows platform-independent file paths to more easily collaborate across OSs.

tigerbait · December 16, 2017, 1:44am

What about environment management? My colleague couldn't run my analysis due to different r and package versions. I've considered anaconda, but it limits to certain packages it seems. Could be wrong as i haven't used this yet. Any suggestions?

nwerth · December 20, 2017, 9:35pm

I wish I could help more, but (as far as I can see) there are only two options:

R is installed and maintained in a standard way on all computers. This means there's an administrator, so it's only possible if you have an IT group willing to do it or are willing to do it yourself.
Everyone agrees to write programs for the latest versions of R and any packages. If somebody doesn't have a package, they install it. If they're behind in versions, they upgrade. Upgrading packages is easy with install.packages, and upgrading R is almost painless with the installr package.

If you go down the admin route in #1, there's a lot of good discussion on keeping everyone compatible in What are the main limits to R in a production environment?. About halfway through the discussion, solutions for "version problems" are discussed. It's a lot of great info, so I won't copy it here.

tigerbait · December 25, 2017, 4:10am

Thanks for the reply. Unfortunately, we are not in a production use situation and consequently are coordinating on our own. Simply writing programs with the latest packages may be easiest option, but this doesn't exactly help resolve the reproducibility problem for R and package updates that potentially cause backward compatibility issues. At a minimum, I guess logging the versions of R and corresponding packages is the way to maintain a record to recreate if needed.