I am puzzling over how to preprocess data from multiple clients. I need some way to read each client's various data files and standardise in preparation for the analysis. The problem is that the data is somewhat messy, with missing rows, variously named files and columns, and other Excel-related inconsistencies. Which approach would you suggest? I also want to do some diagnostics on their data to show where there are issues. I am currently outputting a bunch of png figures and a text log for this, but am considering putting these in an Rmd report.
- Use a common script with if/then/else to handle differences in the data (current approach - messy)
- Use a common script with an input file containing client metadata to handle differences in the data.
- Require/force each client to standardise their data before I preprocess.
- Use a separate script/rmd for each client to standardise their data.
- Use a separate script/rmd for each client to push their data to a database.
- etc...
Any thoughts?