I have this dataset in R that looks something like this:
address = c("882 4N Road River NY, NY 12345", "882 - River Road NY, ZIP 12345", "123 Fake Road Boston Drive Boston", "123 Fake - Rd Boston 56789") name = c("ABC Center Building", "Cent. Bldg ABC", "BD Home 25 New", "Boarding Direct 25") my_data = data.frame(address, name) address name 1 882 4N Road River NY, NY 12345 ABC Center Building 2 882 - River Road NY, ZIP 12345 Cent. Bldg ABC 3 123 Fake Road Boston Drive Boston BD Home 25 New 4 123 Fake - Rd Boston 56789 Boarding Direct 25
My goal is to learn how to remove "fuzzy duplicates" from this dataset - for example, in the above dataset, it is clear to a human that there are only 2 unique records. However, a computer would have difficulty in coming to this conclusion. Therefore, a "fuzzy based" technique has to be used to tackle this problem.
Does anyone have any ideas as to how this can be done? I tried some approaches with FUZZY JOINS, but my real dataset has a few thousand rows and I get an error "cannot allocate vector of size X Gb".
I found some links like this https://cran.r-project.org/web/packages/RecordLinkage/vignettes/WeightBased.pdf , but I am not sure if they are applicable to my problem, and how to apply them.
In the end, I am trying to remove fuzzy duplicates based on the address and name column - does anyone have any idea how I might be able to do this?