Rules for equivalent strings when joining datasets

Is there a package or method to set some 'rules' for R to follow when using inner_join.
I have a dataset where the way the id for some observations is written is different in one dataset to another. For example "Example A Twp" in dataset 1 = "Example A Township" in dataset 2 I would want to set the rule "Twp" == "Township" along with some other rules.

Here is some example data

setting some sort of equivalent rule would be better for me than simply using str_remove(" twp") since there are some observations in the data with similar names

name <-  c("hamilton twp",   "wayne",  "berwick", "east wenatchee", "north bergen", "toms river",  "parsippany-troyhills")
value1 <- c(1, 5, 2, 4, 2, 5, 2)

data1 <- data.frame(name, value1)

name <-  c("hamilton",   "wayne township",  "berwick borough", "east wenatchee city", "north bergen township", "toms river township",  "parsippany-troyhills twp")
value2 <- c(1, 3, 3, 4, 2, 3, 2)

data2 <-  data.frame(name, value2)

If you don't want to standardize names by replacing variants, you can use the fuzzyjoin package

But you will not always get an exact match

I've always just standardized names like Andresrcs says, maybe renaming the original column locality_raw or some such. This is definitely my advice if your use case is locality names like in your example. Replacing all abbreviations with full words would standardize them and flag any issues of vague names. For example, there may be a Berwick Borough and a Berwick City. No way to confidently match just "berwick". This happens annoyingly often in my work.

However, if you're trying to match fields that likely have typos or incorrect formats, then I'd still expand the abbreviations but follow that up with fuzzy matching.

I've struggled with that problem around street names - rife with mispellings and inconsistencies. I tried brute force - filter out problems and write str_replace to fix them, and other similar tactics. But it is a bit like trying to stop the tide coming in with a bucket. My next attempt I plan to try using deep learning to fix the problems - we'll see how well that works.

1 Like

This topic was automatically closed 21 days after the last reply. New replies are no longer allowed.

If you have a query related to it or one of the replies, start a new topic and refer back with a link.