Data checking vs database

Hello kind people of the internet.

I need to compare data from an analysis form vs the data we have in a database, to check if our people have made any mistakes typing data into the database.

Each row have 8 variables (columns) only 1 of them is the same in all of the analysis form, the unique ID. The analysis form have 41 parametres which have been tested for, so 41 rows.

In the database there are 42 total columns and 700+ row because it has all the analysis forms combined.

I have tried to do it with validate package but it can only check 1 column at a time. I need to and entire row with 8 columns with the database and see if all the columns are representet in the database, and if not i would like to know with column doesn't match the database.

Can you help me?

please share your attempt (code) with representative data in the form of a reprex.
FAQ: How to do a minimal reproducible example ( reprex ) for beginners

if you want to check exact matches across all rows for df against database then a simple join would serve.

library(tidyverse)
left_join(df,database %>% mutate(hit=TRUE)) %>% mutate(hit=ifelse(is.na(hit),FALSE,hit))
#Joining, by = c("UniqueID", "Species", "Organ", "Parametre", "Unit", "Result")
  UniqueID Species  Organ Parametre     Unit Result   hit
1        1  Aborre  Liver     Chem1 µg/kg vv      1 FALSE
2        1  Aborre  Liver     Chem2 µg/kg vv      3 FALSE
3        1  Aborre Muscle     Chem3 µg/kg vv      5  TRUE
4        1  Aborre Muscle     Chem4 ng/kg vv      7 FALSE
5        1  Aborre Muscle     Chem5 ng/kg vv      4 FALSE

This topic was automatically closed 21 days after the last reply. New replies are no longer allowed.

This is the test data i made to try the code

df <- data.frame(stringsAsFactors = FALSE,
UniqueID = c(1),
Species = c("Aborre"),
Organ = c("Liver","Liver","Muscle","Muscle","Muscle"),
Parametre = c("Chem1","Chem2","Chem3","Chem4","Chem5"),
Unit = c("µg/kg vv","µg/kg vv","µg/kg vv", "ng/kg vv","ng/kg vv"),
Result = c(1,3,5,7,4)
)

database <- data.frame(stringsAsFactors = FALSE,
UniqueID = c(rep(1,5),rep(2,5),rep(3,5),rep(4,5)),
Species = c("Aborre"),
Organ = c("Liver","Liver","Muscle","Muscle","Muscle"),
Parametre = c("Chem1","Chem2","Chem3","Chem4","Chem5"),
Unit = c("µg/kg vv","µg/kg vv","µg/kg vv", "ng/kg vv","ng/kg vv"),
Result = c(3,2,5,9,8)
)

rules <- validator(Organ == Data_reference$Organ)
cf <- confront(df, rules, ref = list(Data_reference = database))
summary(cf)

This code only checks for one of the variables, but i need to check for all the varibles combined, so if UniqueID is =1 then i only need to the the rows in database where the uniqueID is = 1.

So can i make a code that checks UniqueID = 1 then Species = Aborre, then Organ = Liver, then Parametre = Chem1 and so on...

the forcats package, contains fct_collapse() that can be used to collapse multiple factor levels down. so you should apply this to both datasets, so that you can test the canonical chemical name always ( at least when testing).

This is exactly what i was looking for!!

Now there are some variations. As the parametre is a specific chemical we are testing for, but there are different ways to write the same chemical, is there a way for me to add different variations for the code to register as a hit.

Chem1 might also be called Quicksilver or Hg or hg, so all of these answer are correct.

Hope you can help me with this aswell.