Collecting tidy data - best practices

best-practices

#1

Building on this post by @martj42 which already has a lot of good suggestions.

The HOW:
If you're manually collecting data, how do you set yourself up to reduce the amount of time spent tidying?

Sometimes I find it difficult to imagine what my data would look like in it's tidy state, e.g. I'll have a ID variable, then anything between 10-50 variables where each variable has between 1-10 observations per ID. The manual input feels incredibly messy and error prone in itself. Add a few extra collaborators and you've got the recipe for disaster. Do you collect data in whatever way seems most natural for your current project, or do you keep it tidy from the beginning? How?

The WHERE:
Google spreadsheets and Excel are "good" options - but not very safe in my opinion (I'm still living in a world where Excel files and article feedback gets emailed around with _namedate). I could use google and the googlesheets package to have some level of version control, but I'd prefer something like RedCap. RedCap, however, seems to require a lot of resources and I'm not sure if we'll be able to use it. Are there any other options that works well with R?

Edit: There's now a duplicate of this post at rOpenSci, already with some great input from @noamross!


#2

Do you collect data in whatever way seems most natural for your current project?

I rarely have control over this (like, I can't remember the last time that happened), but I'd say yes. The beauty of writing scripts to process and wrangle your data, is that you don't have to redo that work every time. You certainly want to have your data well-structured, but the best method is decidedly scenario-dependent.

As for the where, I think this also depends. If I had to type a bunch of numbers in, I'd probably use Google Sheets, but I don't know much (read: anything) about RedCap. Another option, if you're entering text that repeats (categories, etc) is using something like Google Forms, where you can restrict the number of choices. That way you can avoid things like spelling mistakes, which can just make processing your data later a bit more annoying, and skip some of the pain of manual entry.

Here's a post on using Google forms for data collection (and analysis, but obviously I wouldn't recommend that :stuck_out_tongue_winking_eye:)

I'm sure there are lots of other options, just speaking to the limited number of tools I know!


#3

Thank you for the feedback!
It's usually a mix of categories and numbers. Google spreadsheets is probably my best option at the moment, but I'm worried about what could happen to my knees if the big bad wolf (GDPR) found out I'm using it - even without personal information.

Do you think most people here find themselves in the same, non-collecting-state as you? It just dawned on me that this post might find a better home in the forum at rOpenSci.


#4

There's always the possibility of using Shiny for data entry (assuming you can find a server to host it on or host it for your organization's LAN).

More broadly, I think that this points to a bit of a hole in the R ecosystem, namely a good standardized method for data entry. I'd be really interested in seeing a better workflow than:

  • Create sheet and corresponding data entry form with Excel or Google Sheets + Google Forms
  • Send out link/file for entry
  • Get results into R
  • Munge
  • Analyze

For small projects this works alright but for large projects it pretty quickly becomes a pain, especially if there's any level of recipient-focused customization/autofilling/etc...


#5

Don't know if this duplicate post thing was the best idea, but if you don't mind venturing outside the confines of RSC, please check out the last two posts at rOpenSci, does such a thing exist?