Tips for reading in multiple messy files

As an environmental statistician I often have to work with multiple messy data files (usually Excel sheets). I can spend a lot of time writing code to read in each of these files/sheets and joining them. Often there are parsing errors and tables have multi-row headers that R functions can't easily read. Often I find errors.

I have thought about trying to write a general purpose script for collecting and joining multiple table of data, but I'm not sure it's feasible. I can also try to standardise the input files more but it's hard to change other people's 20 years of Excel habits.

Does anyone have any tips and tricks you have developed for handling this situation? Maybe you force clients to provide you the data in csv? Maybe you convert each input table to a more R-friendly format (e.g. RDS)? I know one colleague who tends to push all data to a Postgres database. Any ideas?

Has anyone used tidyxl and/or unpivotr and/or jailbreakr with success?

I've read that the janitor package is great for this sort of thing:

1 Like

janitor is good, I use a few of the functions quite a bit.

This topic was automatically closed 21 days after the last reply. New replies are no longer allowed.

If you have a query related to it or one of the replies, start a new topic and refer back with a link.