One of my use-cases of R is for creation of reproducible scientific reports e.g. from clinical trials. As a trial goes on I re-run my R script to produce nice looking LaTeX reports with key statistics, plots and so on. Usually my data comes in as a relational database - there is a dataframe for patients descriptives, there is a dataframe for each visit, then might be a dataframe for drugs etc. In other words - I do a lot of joining.
Some things can go wrong when you are (left) joining - for example - patient patient descriptives to visits:
- some visits might have mistyped patient id and cannot be joined
- there might be patients for which there seem to be no visits at all (which might indicate a problem)
- for some visits there might be duplicated entries in patients table
- ...and so on
As data changes some problems might show up, while some might be solved. This is why I usually write function that validates the join after doing it and prints stats on matched percent, doubles etc.
Now the question: are you aware of any existing packages / good practices for joining validation?
I personally also would love to see the option in dplyr to request short summary after join