Selecting the best match using Jaro-Winkler in stringdist

Leanne · April 20, 2020, 2:31pm

I'm using the Jaro-Winkler method in stringdist to match two large databases. Basically, they both contain some of the same companies and addresses, but sometimes the names and addresses are spelled differently. I currently have an output-file with all matches that have a dist < 0.3, since I don't want different spellings etc to be filtered out of my final file. However, some rows return multiple matches that meet this criterium. I would like to find a way to only include the closest match in case of multiple matches, so that my final output file doesn't become as large (as I have to go over it manually afterwards).
This is the relevant bit of code, where x is the country name and matches are made on the bases of similar company names and cities.

y <- stringdist_join(b, a, by=c(INSTALLATION_NAME="FacilityName", CITY="City"),mode="left",ignore_case=TRUE,method="jw",p=.05,distance_col="dist")

assign(x, (y%>%filter(INSTALLATION_NAME.dist<0.3, CITY.dist<0.3)))

nirgrahamuk · April 20, 2020, 2:43pm

something like

z <- x %>% 
  group_by(INSTALLATION_NAME, CITY) %>%
  filter(row_number(INSTALLATION_NAME.dist * CITY.dist)==1)

Leanne · April 20, 2020, 2:52pm

This seems to work, thanks so much for the quick reply! Would have taken me ages to figure out

system · April 27, 2020, 2:52pm

This topic was automatically closed 7 days after the last reply. New replies are no longer allowed.