Removing Fuzzy Duplicates in R?

omario · November 13, 2022, 11:33pm

I have this dataset in R that looks something like this:

address = c("882 4N Road River NY, NY 12345", "882 - River Road NY, ZIP 12345", "123 Fake Road Boston Drive Boston", "123 Fake - Rd Boston 56789")
            
 name = c("ABC Center Building", "Cent. Bldg ABC", "BD Home 25 New", "Boarding Direct 25")
            
my_data = data.frame(address, name)

                            address                name
1    882 4N Road River NY, NY 12345 ABC Center Building
2    882 - River Road NY, ZIP 12345      Cent. Bldg ABC
3 123 Fake Road Boston Drive Boston      BD Home 25 New
4        123 Fake - Rd Boston 56789  Boarding Direct 25

My goal is to learn how to remove "fuzzy duplicates" from this dataset - for example, in the above dataset, it is clear to a human that there are only 2 unique records. However, a computer would have difficulty in coming to this conclusion. Therefore, a "fuzzy based" technique has to be used to tackle this problem.
Does anyone have any ideas as to how this can be done? I tried some approaches with FUZZY JOINS, but my real dataset has a few thousand rows and I get an error "cannot allocate vector of size X Gb".

I found some links like this https://cran.r-project.org/web/packages/RecordLinkage/vignettes/WeightBased.pdf , but I am not sure if they are applicable to my problem, and how to apply them.

In the end, I am trying to remove fuzzy duplicates based on the address and name column - does anyone have any idea how I might be able to do this?

Thanks!

technocrat · November 14, 2022, 9:29am

Interesting problem.

cannot allocate vector of size X Gb may be fixable, at least on unix-derived systems with adequate RAM. There is a ulimit that, by default, restricts any single process from grabbing the more than X Gb (usually 8). This can be adjusted.
Even if changing the ulimit is not feasible, due to memory constraints or administrative privileges, if FUZZY JOINS is satisfactory for a few hundred rows in terms of the memory constraint and the results you seek, there's no reason not to write a function slicing the data row indices and iterating over them.
Without a decision rule to specify which duplicate should be discarded, the best that any approach can do is to flag for human review those records that appear to be candidates. For that, a function that collect the row index of each subject into a vector or list would allow a simple subset to pull the rows for examination.
The problem doesn't demand a single tool. As an example, assuming that each building name has a single street address (basically, no suite numbers), creating a sorted vector of unique entries in name permits identification of duplicated name values that may have competing address values also. Assuming that the number of unique values is less than the number of rows, run-length encoding using base::rle() can identify successive identical name values.
Since street addresses commonly begin addresses those could be extracted with regular expressions and subject to rle
Natural language processing tools, such as those provided by {tinytext} can create a corpus which tokenizes the input, allowing identification of words and abbreviations for the same entity, such as Road and Rd or Drive and Dr. You can then expand the abbreviations to allow more direct comparisons.
Finally, NLP can provide metrics of lexigraphic differences between text fragments like those in the address variable to identify "close" matches. Cruder, but possibly effective are sounds-like tools.

AlexisW · November 14, 2022, 9:55am

Just leaving these links here, as I think they have a lot of useful discussion for this question (short version, it can be really hard)

omario · November 14, 2022, 2:57pm

Thank you for these suggestions! I will take a look at it!

omario · November 14, 2022, 2:57pm

thanks! will check this out!

system · December 26, 2022, 2:58pm

This topic was automatically closed 42 days after the last reply. New replies are no longer allowed.

If you have a query related to it or one of the replies, start a new topic and refer back with a link.