How do you deal with mixed/imperfect/terrible data?



I'm sure there's not "correct" answer to this, but I'd like to get the opinion from some of you with more experience from ds and statistics.

We've done a study where the resulting data fit perfectly with the description in the header. This is of course partially due to bad planning and unforeseen problems, but also due to the intrinsic nature of what we're studying. Our main problem is that we have two populations where some got one test, and the other got two tests, and we'd like to include both of these in the same analysis. Another problem is that we have no true reference standard, rather we're just comparing proportions of totals.

Reading articles from similar studies they might mention the likes of "used a more conservative p-value of 0.01 to account for repeated measures", GEE, bootstraps, glmer and McNemars test, or they've simply ignored the issue altogether. My field is radiology, so these are just ment as examples to a more general problem.

I'm not looking for a specific answers to my problems, just a general discussion and reasoning around what you would do in situations with messy data like this.

  • Why do you choose one method over the other?
  • Whats the consequence of choosing one method over the other?
  • How should the aspiring ds/statistician deal with this?