identifying exact or near duplicate names in a dataset

A really good way to ask this is with a minimal REPRoducible EXample (reprex)? A reprex makes it much easier for others to understand your issue and figure out how to help.

Here's a video guiding you through how to make one: [Video] Reproducible Examples and the `reprex` package


In past projects I've use stringdist with good success. Colin Fay created a few vinettes on this method here:
https://cran.r-project.org/web/packages/tidystringdist/vignettes/Getting_started.html

For a longer exploration, Colin has a blog post below, which works up to string distance on the Game of Thornes dataset (about halfway down for where the string dist discussion starts).

String distance might be problematic if you have too many companies with similar names.


Stack overflow has a nice discussion:

1 Like