Hello,
A data cleaning process I often find myself doing is replacing missing or incorrect data in one dataset using another reference dataset. I feel like I'm missing some well established methods or tools around this practice.
I do something like this reprex:
library(tidyverse)
#make some bad data
bad_cars <- mtcars %>%
rownames_to_column(var = "brand_model") %>%
select(brand_model, cyl, gear, carb) %>%
mutate(cyl = ifelse(cyl>4, cyl-1, cyl), #making some bad data
carb = ifelse(carb<2, NA, carb)) %>% #making some bad data
as.tibble()
#reference data
good_cars <- mtcars %>%
rownames_to_column(var = "brand_model") %>%
select(brand_model, cyl, gear, carb) %>%
as.tibble()
#function to correct bad data with reference data using a 'mutating' join
mutate_join <- function(x, y, by, var){
var = enquo(var)
var_ref = sym(str_c(quo_name(var), "_ref"))
x %>%
left_join(select(y, by, !!var), by = by, suffix = c("", "_ref")) %>%
mutate(!!quo_name(var) := ifelse(!!var == !!var_ref, !!var, !!var_ref)) %>% #different data
mutate(!!quo_name(var) := ifelse(is.na(!!var), !!var_ref, !!var)) %>% #missing data
select(-matches("_ref")) #remove ref col
}
#demonstrate
corrected_cars <-
bad_cars %>%
mutate_join(good_cars, by = "brand_model", cyl) %>%
mutate_join(good_cars, by = "brand_model", carb)
corrected_cars
#confirm
identical(corrected_cars, good_cars)
This could be done a lot better I'm sure. Are there any good practices/packages people can recommend around this process of selectively correcting a df based on a reference df?
Cheers,
Ben