I am working on a project that pulls in data from a number of spreadsheets and csv files scattered across different directories (Windows 10). I have a script to gather this data and create RDS files that are used by my shiny app. This script is getting big and messy.
For one class of files I was able to make a "metadata" table where each row described a data item to be obtained (with its directory path, filename, column name, new name, numerical scaling factor). The I was able to gather the data into a tidy dataframe using a loop. This only worked because these files were all in "wide" format (one variable to a column, one observation to a row).
For another class of files the file formats are variable, there are typos and missing data, sometimes multiple ID columns that are inconsistent or have missing cells.
In the past I have always done the file repair within my preprocessing script. Which gets messy. But now I am wondering whether there is a separate "file repair" step, which could be done in a separate R script or even pushed back to be done manually on the Excel originals (e.g. formally say I will only accept data files formatted like THIS...).
Do you have a workflow that handles this kind of situation of combining data from multiple files with different formats and errors?