What's your workflow for gathering messy data from multiple sources?

General topic:

I am working on a project that pulls in data from a number of spreadsheets and csv files scattered across different directories (Windows 10). I have a script to gather this data and create RDS files that are used by my shiny app. This script is getting big and messy.

For one class of files I was able to make a "metadata" table where each row described a data item to be obtained (with its directory path, filename, column name, new name, numerical scaling factor). The I was able to gather the data into a tidy dataframe using a loop. This only worked because these files were all in "wide" format (one variable to a column, one observation to a row).

For another class of files the file formats are variable, there are typos and missing data, sometimes multiple ID columns that are inconsistent or have missing cells.

In the past I have always done the file repair within my preprocessing script. Which gets messy. But now I am wondering whether there is a separate "file repair" step, which could be done in a separate R script or even pushed back to be done manually on the Excel originals (e.g. formally say I will only accept data files formatted like THIS...).

Do you have a workflow that handles this kind of situation of combining data from multiple files with different formats and errors?

2 Likes

I'll add a thought to my own question. How do you decide whether to clean data at its source rather than clean it on import to your project? I suggest:

  1. If cleaning the data at its source will benefit other users of the data.
  2. If there is missing or miscoded or inconsistent information (e.g. ID fields) that needs to be added/fixed/regularised.
  3. If the data format is hard to work with, or requires a lot of reorganisation.
  4. If the data is very big and we only need a small subset.

Here's a little template for cleaning data at its source. It runs in RStudio but doesn't require a project or setwd(). The advantage of this is you can just put it in the directory with the data file, double click to open in RStudio, and source it without worrying about setting the working directory.

# script to clean a data file
# script should run in RStudio without needing an RStudio project or setwd()

library(rstudioapi)
library(stringr)
library(readr)

# find path (only works in RStudio interactive)
path <- rstudioapi::getActiveDocumentContext()$path
path <- str_replace(path, "(?<=/)[^/]*$", "")
original <- "rawdatafile.csv"
cleaned <- str_replace(original, "\\.csv", "_clean.csv")

# read raw file
fname <- paste0(path, original)
print(paste("Reading original data file:", fname))
raw <- read_csv(fname, col_types = cols())

# clean the data file here
temp <- raw
# ...
# ...
# ... 
clean <- temp

# save clean version
fname <- paste0(path, cleaned)
print(paste("Writing cleaned data file:", fname))
write_csv(clean, fname)

I'm not sure if this answers your question but on a couple of projects, I have been using Python to move, rename, format, and combine files, as well as upload to a Google sheet, instead of R, and also directly ahead of using the final product for RStudio analysis. I found it easier to use for this particular application. To your point, because my R abilities are still limited, I ended up eventually cleaning up the data at the source for the second project, which specifically was exporting a second version of a file that had device-breakouts and was problematic to flatten again for my needs. Ultimately it proved easier to allow the exporting source, an AdManager instance, to export the data in a more task-relevant format. It may well be that in the future I will find a way around this, but for now, I ended up exporting 2 files, one for each aspect of that specific project.

1 Like