Tips for managing messy data from multiple clients requiring the same analysis?

I am puzzling over how to preprocess data from multiple clients. I need some way to read each client's various data files and standardise in preparation for the analysis. The problem is that the data is somewhat messy, with missing rows, variously named files and columns, and other Excel-related inconsistencies. Which approach would you suggest? I also want to do some diagnostics on their data to show where there are issues. I am currently outputting a bunch of png figures and a text log for this, but am considering putting these in an Rmd report.

  1. Use a common script with if/then/else to handle differences in the data (current approach - messy)
  2. Use a common script with an input file containing client metadata to handle differences in the data.
  3. Require/force each client to standardise their data before I preprocess.
  4. Use a separate script/rmd for each client to standardise their data.
  5. Use a separate script/rmd for each client to push their data to a database.
  6. etc...

Any thoughts?

Hi,

Data cleaning is often times the most time-consuming part of data science, so I feel your frustration :slight_smile:

Given I don't know anything about the different datasets, it's very hard to come up with some sound advise. Is it just the column names that are different, or do you have to prep some columns differently between clients before they can be compared? (e.g. different formatting of dates, string, etc)

If the differences are consistent, you can indeed create a metadata table once that holds instructions on how to pre-process data from different clients. Here is an example on how you could standardise different column names

#Change per client
myClient = "client1"

#Pre-made conversion table
dataConversion = data.frame(
  standard = c(paste("Col", 1:4)),
  client1 = LETTERS[1:4],
  client2 = sample(LETTERS, 4)
)

#Here you should read in the client's file (csv, xlsx, ...)
readData = data.frame(X = runif(5), B = runif(5), C = runif(5), A = runif(5), D = runif(5))

#Standardise
readData = readData[,dataConversion[,myClient]]
colnames(readData) = dataConversion$standard

readData

I'm happy to help out more, but I'd really need more details on the differences in data sets.

Hope this helps,
PJ

Thanks Pieter. I was really going for a more general discussion, lessons you've learned etc.

In my particular project each client's data is in a folder. There is:

  • a folder with a shape file (folder and file name and metadata vary)
  • a folder with many csv files, all in the same format thankfully.
  • several xlsm workbooks each with several relevant tabs (the names of workbooks are somewhat inconsistent but the tab names and column names are fixed - the actual data may have noise)
  • an xlsx workbook with a tab for that client (I copied this into each client folder to keep things together. and tab names are not consistent)
  • several xlsx workbooks with a single tab (file names are inconsistent, data has some noise).

So there are several files with several tables of data that need to be cross referenced (and the keys are not always consistently entered). It's fairly obvious to a human what is what, but it gets messy trying to explain to R which files and tabs are needed and how to convert them.

Added to this is that project data is being collected in real time, and new data added regularly. So a manual tidy up won't work since it would need to be repeated.

Hi,

That is indeed a big mess of files :slight_smile:

If you like more general advise: I think that if you end up with a large amount of files in different formats and need to keep track with other meta-files, a database solution might be more elegant.

I often used a SQLite database (file-based) in which I import different files as different tables, but create links between them or split them in different tables depending on the logic and the structure. This way, you don't have to keep on browsing different files and filenames, as you only have to put them in the database once. The great thing about SQLite is that you can then pull any part of the data you want into R and write results back, without the need for creating more files or keeping track of names, version etc.

You would have to write some code to populate the database for each client, but once in there it should be straightforward. Designing a good database however takes some time and skill, and I don't know if you are familiar with that, but it changed the way I handle data once I got the hang of it! It saved a lot of time once you know how to put things neatly in a database and with the power of SLQ and R you can quickly get any data you need from any source.

Any thoughts?
PJ

1 Like

Yes a colleague recommended the database route recently, he uses PostgreSQL. I have not done this before and it's intimidating. Also it makes it harder for the other team members to view the data.

What do you do when the data is being added to in real time, so the tidy up and push action has to be repeated?

Hi,

I have no experience with PostgreSQL, but for that it seems you have to setup separate software (a database management system). SQLite is file based, so no extra software is needed to access it from R, though https://sqlitebrowser.org/ has a neat little tool to explore it more visually.

The point of databases is that they can cope with real time updates, and it ensures that you are always working with the latest data, although this can become a bit complicated depending on how 'real-time' this is. SQLite is not made for handling lots of updates and connections at the same time, but if you mean the data is updated from time to time, it should not be an issue.

Regarding the view for team members, you can either create summary tables (views) in the database that pull data from stored queries and update every time the DB updates, or you can use R pull data and generate reports in any format you like.

PJ

1 Like

As a side note, PostgreSQL is a more complex solution but it vastly pays off in functionalities, for example in PostgreSQL you can define foreign data sources even from an excel file (although it's not a trivial task) and query them as if they where any other table in the database.
So my advice is, if you can afford the time to familiarize with PostgreSQL, it would be time well expended and it will pay off in the future.

This has more to do with dealing with your organizational culture, if you can't convince them to standardize the process, you are going to have to come up with a lot of custom tailored ETL scripts and schedule them.

1 Like

This topic was automatically closed 21 days after the last reply. New replies are no longer allowed.