Summary / warning after join

dplyr

#1

One of my use-cases of R is for creation of reproducible scientific reports e.g. from clinical trials. As a trial goes on I re-run my R script to produce nice looking LaTeX reports with key statistics, plots and so on. Usually my data comes in as a relational database - there is a dataframe for patients descriptives, there is a dataframe for each visit, then might be a dataframe for drugs etc. In other words - I do a lot of joining.

Some things can go wrong when you are (left) joining - for example - patient patient descriptives to visits:

  • some visits might have mistyped patient id and cannot be joined
  • there might be patients for which there seem to be no visits at all (which might indicate a problem)
  • for some visits there might be duplicated entries in patients table
  • ...and so on

As data changes some problems might show up, while some might be solved. This is why I usually write function that validates the join after doing it and prints stats on matched percent, doubles etc.

Now the question: are you aware of any existing packages / good practices for joining validation?

I personally also would love to see the option in dplyr to request short summary after join :slight_smile:


#2

Take a look at functions in tidyr. Specifically, tidyr::crossing, tidyr::nesting and all of that.

It allows you to detect some of the problems that you've mentioned.


#3

Thanks! I am aware of tidyr, and I am able to detect all of the problems I described. I just wonder whether there is a package that does this routinely / automagically after each join.


#4

It's a bit out of scope for dplyr, but there are two packages, fuzzyjoin


and ruler

that come to mind as potentially being of interest.

At the bottom of the ruler README there's an Other Packages for Validation and Assertions section that has several other packages that might be to your liking (I simply haven't tried all of them out).


#5

Both are very interesting, thanks!


#6

Take a look at this talk from eRum:

It is about this package: https://github.com/data-cleaning/validatetools

From what I understand, it won't do anything automagically, but at least you can set up your logic once and then check it all the time.