Best practices for collaborating in git with private data

At my new job, we’re thinking through the best way to store, share, and structure our shared research projects and hoping for some input on what works for other folks.

The basic idea is that for each project we work on, we have multiple analysts that may contribute and review code. The data for these projects cannot be shared publicly, so we don’t want to track it in git (even with a private GitHub repo). What are the best approaches for giving analysts access to the data in their local dev environments? Most advice I’ve read assumes that the code and data live in the same project folder, but I don’t think that is what we’re aiming for.

One thing we have considered is storing the data in SharePoint/OneDrive and using Microsoft365R to access. In this way, the data isn’t stored in the same project folder as the analysis code, but we can write code to access the data files that should work for everyone. We also could write out cleaned versions of the data or other output to SharePoint/OneDrive that we similarly do not want tracked in git. Does this seem reasonable? Or is it over complicating things by having separate storage locations for data and code?

1 Like

At my company we have a private GitLab installed on an on-prem server. It's inside our firewall and connected with the same Microsoft authentication as all our other software. Maybe something like that would meet your needs.

1 Like

Thanks @arthur.t! I’m not sure my org has the resources to host anything on-prem at the moment. But, in your case do you store data for your projects in repos on the GitLab server along with code and other documentation/output?

No prob. I hear you.

We don't have a formal policy other than all code should be in GitLab by project completion, with hopefully periodic pushes throughout the life of the project.

I usually have a local .git that I initialize and commit to. And then I create a project on the GitLab via the web interface and choose "no" to initialize. And then link my local .git to the GitLab project and then start committing and pushing every few hours.

As far as the static data, it's usually in SharePoint. And I use the SharePoint sync feature to mirror it as a Windows directory. Then I can reference that path in my code. Someone in future could hypothetically clone my code, sync Sharepoint, and edit a few paths in the code, and get it to work. It's not the best solution. The Sharepoint sync path has your username in it. But we haven't come up with anything better. Most of our projects that have to reference static data are "one-offs" and so we don't sweat it too much.

The larger projects that produce living software tend to reference data in a database, and have more developer resources, so there's no SharePoint component.

1 Like

Thanks for the additional details! It sounds like we may have a similar set up with static data hosted on SharePoint. Using SharePoint syncing sounds like it works well for a single analyst, but could get cumbersome if each analyst working on the project has the reset the path to the data when they clone the project.

I haven't gotten Microsoft365R up and running yet (waiting for IT to change some permissions), but that seems like it could be a solution to SharePoint data access that doesn't rely on a user-specific path and could be shared across analysts.

And yes, totally agree that would be :100: to have our data in a database, but not sure that's possible yet for us.

1 Like

Cool. Yes, that's exactly right. There's probably some refinement you could bring to the SharePoint sync strategy, like getting the user name and constructing paths programmatically, so anyone who's create a sync could run the software without editing paths in the code. But I'm sure there are better solutions out there for sharing access to static files with authentication.

A lot of our data is R&D data from labs, so it's static files produced by an instrument or copied from a scientist's computer. Sometimes a nightmare...

This topic was automatically closed 21 days after the last reply. New replies are no longer allowed.

If you have a query related to it or one of the replies, start a new topic and refer back with a link.