Data QC Ideas for a New Scientist

I've recently found myself in a position where I have to QC large amounts of data. As some background, I did some work with R in college for my Thesis, however it has been a few years since I have had the need for it, so for the most part I am getting back into it. Mostly I'm looking for ideas, thoughts, and packages to use on how to go about what I have below. Any input is appreciated.

So generally, I will have 2 excel files (Working File and Final Report), from which I only really care about 2 columns, say "id" and "Result" from each spreadsheet. In my mind what I want to do is import just those columns into 2 data sets (keeping "id" and "Result" paired...i.e. each "id" stays with it's "Result"), then compare the two for any differences in "Result". The "id" should be identical for each spreadsheet (for if not, that is an entirely different problem). What I would hope to get is a return that either shows all matches, or spits out "id"/"result" that are inconsistent between the two documents.

Would this be possible to include a small margin of error, something small like 1% just to accommodate rounding, if that had happened between the documents - hopefully not, but in either case a thought that occurred to me.

I really do appreciate any pointers as I fiddle around with this.

1 Like

Hi,

Welcome to the RStudio community!

Here is an example of what you can do. I used dummy data, but you will need to load in your Excel files of course.

library(tidyverse)

#Get the data (this is dummy data)
#In your case you read from excel with for example readxl::read_excel()

file1 = data.frame(
          id = c(1L, 2L, 3L, 4L, 5L, 6L, 7L, 8L, 9L, 10L),
          colA = c("a","b","c","d","e","f","g","h","i","j"),
      result = c(0.936,0.031,0.57,0.882,0.847,0.981,
                 0.854,0.66,0.274,0.578)
)

file2 = data.frame(
  id = c(1L, 2L, 3L, 4L, 5L, 6L, 7L, 11L, 9L, 10L),
  result = c(0.936,0.031,0.57,0.882,0.847,0.981,
             0.854,0.66,0.272,0.578)
)

#Only keep columns needed
file1 = file1 %>% select(id, result)


#Join the data together (full join)
allData = full_join(file1, file2, by = "id")
allData
#>    id result.x result.y
#> 1   1    0.936    0.936
#> 2   2    0.031    0.031
#> 3   3    0.570    0.570
#> 4   4    0.882    0.882
#> 5   5    0.847    0.847
#> 6   6    0.981    0.981
#> 7   7    0.854    0.854
#> 8   8    0.660       NA
#> 9   9    0.274    0.272
#> 10 10    0.578    0.578
#> 11 11       NA    0.660

#Compare results
allData %>% 
  mutate(identical = result.x == result.y)
#>    id result.x result.y identical
#> 1   1    0.936    0.936      TRUE
#> 2   2    0.031    0.031      TRUE
#> 3   3    0.570    0.570      TRUE
#> 4   4    0.882    0.882      TRUE
#> 5   5    0.847    0.847      TRUE
#> 6   6    0.981    0.981      TRUE
#> 7   7    0.854    0.854      TRUE
#> 8   8    0.660       NA        NA
#> 9   9    0.274    0.272     FALSE
#> 10 10    0.578    0.578      TRUE
#> 11 11       NA    0.660        NA


#Compare results with rounding
allData %>% 
  mutate(identical = round(result.x,2) == round(result.y, 2))
#>    id result.x result.y identical
#> 1   1    0.936    0.936      TRUE
#> 2   2    0.031    0.031      TRUE
#> 3   3    0.570    0.570      TRUE
#> 4   4    0.882    0.882      TRUE
#> 5   5    0.847    0.847      TRUE
#> 6   6    0.981    0.981      TRUE
#> 7   7    0.854    0.854      TRUE
#> 8   8    0.660       NA        NA
#> 9   9    0.274    0.272      TRUE
#> 10 10    0.578    0.578      TRUE
#> 11 11       NA    0.660        NA

Created on 2021-08-11 by the reprex package (v2.0.0)

All of this was done by functions from the tidyverse (the dplyr package). You can learn more about it here.

Hope this helps,
PJ

.

2 Likes

PJ,

WOW! This set me ahead far of what I imagined. Immensely helpful. I was thinking in a different direction but this works so much more efficiently. I'm used to converting the excel files into csv so I went that route, which still works just fine with what you suggested.

The only thing I cannot get to work is the rounding bit. I get this error:

Error: Problem with mutate() input identical.
x non-numeric argument to mathematical function
i Input identical is Result.x == round(Result.y, 1).

I think the issue, and it was my fault, was not indicating that file 1 would be the original data, and file 2 would be the final report which could potentially already be rounded. For arguments sake say to 1 decimal. Is there a way to compare the 2 determine if they are the same? Perhaps within like a 1% difference or something to account for 1 set of values being rounded?

This would be worst case scenario, as hopefully I can get the unrounded data and then just use the rounding formula to export reports, but sometimes I'm not so lucky.

Again, I appreciate the help

1 Like

Hello,

It's very hard to guess the issue when you are now using your own data. I assume the problem is not happening when you run the code / data I provided.

It's best to share a reprex of the work you have done with your data so far, and we can then take a look at the error. It should be something easy to fix as I guess there is a data type issue (non-numeric argument to mathematical function).

A reprex consists of the minimal code and data needed to recreate the issue/question you're having. You can find instructions how to build and share one here:

Try to create one with the guide above, this will really help if you need this forum in future again as well!

PJ

2 Likes

As indicated in the answer above you should run the rounding for both sides. You already indicated "it might be rounded already".
So rounding a rounded number doesn't change the outcome, but not rounding the not rounded number does.
Alternatively you can calculate the range you want it to match on your own (see last line)

x = 1.456
y = 1.46

x == y                       # FALSE
round(x,2) == round(y, 2)    # TRUE
round(x,2) == y              # TRUE, but only when values are actually rounded, so avoid this!
y >= x * 0.9 & y <= x * 1.1  # y within 1% window
2 Likes

PJ - Thanks for the help! I did realize that my error was coming from the fact that some entries consisted of something like >24000 or a NaN. Since it's a mathematical function it was shooting errors obviously because they are not numeric. So the issue was not with the code, but rather the dataset. As a note, now that you mentioned it and I took a look, reprex is definitely what I will need for posting code in the future! Pretty straight forward instructions.

Matthias - That's a good alternative, however I imagine I would get much the same issue because of the dataset.

I'm now trying to work out a way to still use the rounding function but to ignore non-numeric entries.

Thanks again to you both for your input, it has been highly appreciated

2 Likes

This topic was automatically closed 7 days after the last reply. New replies are no longer allowed.

If you have a query related to it or one of the replies, start a new topic and refer back with a link.