Manual data collection

I'd like to get a discussion going on what is, in my opinion, a rarely discussed topic - gathering data by hand for use in the tidyverse. A lot of the data I work with is collected by hand - either smaller one-time projects or recurring data collection over months/years.

So far it's been a pretty messy mixture of Excel, Google Sheets, editing csv files in a text editor and sometimes using add_row() in R. It works but I'd love to simplify and streamline the process. The googlesheets package seems very useful but I haven't yet used it extensively.

So here are my more specific hypothetical questions:

  1. If you were to gather data on the t-shirts that you own, say, colour, size, age, if there's something printed on it etc - how would you do it?
  2. If you were to write down the colours of the first 50 cars driving past your window every Monday for a year - how would you do it?

I'm assuming manual data entry for these. Sure, technically you could take pictures of the shirts and ML some of the data or set up a webcam to track the car colours. But let's say we have to do it by hand.

Any other tips, tricks or experiences you have to share, I'd love to hear that too.

All the best! :+1::+1::+1:

5 Likes

I also have to manually enter a lot of data for my job, it's a pain. What I almost always end up doing, just to keep myself organized, is to create Excel Workbooks. Excel makes it so easy to type in data. Then to get it into R, I either use the readxl or XLConnect package depending on if it is multiple sheets or not. I've been trying to get into the habit of creating a single sheet with a variable to distinguish entries, the way I would when setting up multiple sheets in the same workbook. I know it's not all in R, but I find that for entering lots of data that you have to do by hand, I've had the best luck with Excel.

3 Likes

I follow the same process as @dlsweet. I am involved in two projects where I use Excel to manually enter records. I then import that data into R using readxl when I need to work with that data.

An important consideration might be whether you're working with others on the data entry. The Excel solution might not be the best choice if this is the case. Since I only work on these projects myself, I like Excel's interface the best.

2 Likes

One more vote here for the @dlsweet method. A lot of my data comes from data entry into things like Excel sheets, SurveyMonkey, or similar, and most of it is done by students or lower level staff, so the data entered often isn't uniformly formatted. Annoying things like spaces, misspellings, etc. crop up and have to be dealt with.

My two approaches have been 1) try to standardize data entry (i.e., where I can, limit the amount of free text that can be entered, so for example create numeric coding to represent expected responses) and 2) easy pre-processing in Excel (e.g., if I see the same spelling mistake being made frequently, do a quick find-replace before importing into R, or for another example - standardizing date/times before importing the .csv into RStudio makes my life SO much better).

The first approach is a mixed bag - generally students or staff don't "get" why data needs to entered in a standard manner, or they try but aren't meticulous. Using something like SurveyMonkey helps this a little bit, but it takes more time than just setting up a spreadsheet for data entry. Another option would be to format each column of the spreadsheet to allow only certain value types to be entered, but I've found that more confusing and ultimately not helpful from the perspective of the people doing the data entry.

Admittedly the second approach could be handled within R, but I'm still hanging out on the learning curve of R and sometimes the dumb way is also the fastest way. :wink:

Interested to hear others' solutions!

2 Likes

If you can really nail down the possible variables (columns) that might go in your data collection sheet (both of your hypothetical questions work for this), then I've had luck with the idea of creating a Google Forms front end that links to a Google Sheet backend automatically.

I did this once to get instant feedback on a presentation I gave. It had features such as a 1-5 scale on performance (input validation automatically!), other questions with drop down or multiple choice answers, and then some open ended text box answers where the person answering had freedom to enter more custom feedback.

I could imagine for the first hypothetical question having a form with 4 questions

  • colour - could be a multiple choice answer with one last box for "Other" where you can enter a color for strangely colored shirts
  • size - pretty easy (small, medium, large, etc)
  • age - "linear scale" option works well for this. 1-10 years or something.
  • Something printed on it - Either TRUE/FALSE or maybe a custom text box that allows you to enter what is printed on it.

As soon as you submit the form it is automatically added as a new row on the google sheet! Its pretty fun to watch answers roll in. This especially works well if you have multiple people entering answers and you want to make sure answers are in a consistent format and you also don't have to worry about people overwriting each other.

6 Likes

I'll heartily endorse the @davis method - I use this with students who are collecting highly structured data. One entry on the form per observation, export the results to Google Sheets, and then use the google sheets R package to read it all directly into a data frame.

I also use it for teaching - I collect entry tickets with questions about how the semester is going, how specific assignments went, etc.

1 Like

I wonder if R has a package that's like a Shiny version of Google Forms?

1 Like

I have multiple full time paid staff doing this (70% for commercial clinical trials, 30% for grant-based projects). The standards for electronic data collection (EDC) when money is on the line are much, much higher than this. Ideally we need a good open source EDC solution. This should include:

  • setting data type for each field
  • setting rules for range for each field, with human-readable rejection messages
  • active rejection of incorrect data types or out of range data
  • ability to audit all data entry, all changes to data - who entered or changed data, and when
  • ability to annotate each data field - with notes on why data was changed, source of new data, reason for change, etc.
  • ability to do double data entry and compare the 2 datasets.

Closest thing to this that I know of is REDCap, which is...OK.

If your organization has the resources to support REDCap, I highly recommend it. When I was working at Cleveland Clinic, we adopted REDCap and my data formatting time dropped by more than half. After figuring out how to make use of the API, the data formatting time dropped to almost zero.

CRAN has two packages for reading data from REDCap, redcapAPI, and REDCapR

3 Likes

Thanks for the heads up. I was also thinking about trying out the validate and datamaid packages for data checking. But redcapAPI and REDCapR will be super helpful.

I am 110% with @nutterb and regularly stand on his shoulders to get data out of REDCap into R using his package.

As an avid REDCap user i can confirm that it is a simple, secure tool to build forms to collect data.

When doing entry for yourself by yourself a Google doc or an Excel spreadsheet is probably fine. Like others have said you can put some data validation on the cells to keep yourself honest and within the limits of the formatting.

It is when you are building spreadsheets to share. If that is the case a Google form is probably the best approach for free. You could do a more complicated version of the same with access and a form. Beyond that you get into theore commercial forms of data entry that have been mentioned.

2 Likes

@martj42, Pre-select the features you want to document, with examples to clarify borderline cases. For instance, with cars, figure out in advance how you're going to handle two-tone paint jobs, spray-paint images, and pearlescent, multi-color effect paints. Create a form that allows quick but accurate data capture. Trial it with multiple people to improve the design. Make sure that if anyone else is collecting the data, they're trained in the process and the assessment. However many assessors you have, perform an attribute agreement assessment analysis to make sure that you obtain meaningfully consistent results from your measurement system.

Great advice. This is precisely the protocol we followed when I worked with chimps who used ASL (weird, I know)— the need to regularize behavioural observations was obvious there, but it was a :bulb: moment to implement the same thing for manual (digital) data collection in my wet lab in college.

1 Like

I use Shiny to capture data from other people in a structured manner all the time. It works really well as you can generate graphs/views of the data in real time which motivates others to provide their input :slight_smile:

Inspired by Rstudio Conf 2018 and the usethis package 2018_02_03-rstudio_conf-You_can_make_a_package_in_20_minutes.utf8

this is my attempt at scaffolding a shiny app for structured data capture.

And spawned this question on how best to validate forms in Shiny using JS

2 Likes

Google Forms puts its answers into googlesheets, so you can use googlesheets package to import.

2 Likes