I'm using the Jaro-Winkler method in stringdist to match two large databases. Basically, they both contain some of the same companies and addresses, but sometimes the names and addresses are spelled differently. I currently have an output-file with all matches that have a dist < 0.3, since I don't want different spellings etc to be filtered out of my final file. However, some rows return multiple matches that meet this criterium. I would like to find a way to only include the closest match in case of multiple matches, so that my final output file doesn't become as large (as I have to go over it manually afterwards).
This is the relevant bit of code, where x is the country name and matches are made on the bases of similar company names and cities.
y <- stringdist_join(b, a, by=c(INSTALLATION_NAME="FacilityName", CITY="City"),mode="left",ignore_case=TRUE,method="jw",p=.05,distance_col="dist") assign(x, (y%>%filter(INSTALLATION_NAME.dist<0.3, CITY.dist<0.3)))