Deduplicating Records in R

omario · November 14, 2022, 2:59pm

I have the following dataset in R:

address = c( "44 Ocean Road Atlanta Georgia", "882 4N Road River NY, NY 12345", "882 - River Road NY, ZIP 12345", "123 Fake Road Boston Drive Boston", "123 Fake - Rd Boston 56789", "3665 Apt 5 Moon Crs", "3665 Unit Moon Crescent", "NO ADDRESS PROVIDED", "31 Silver Way Road", "1800 Orleans St, Baltimore, MD 21287, United States", 
"1799 Orlans Street, Maryland , USA")
            
 name = c("Pancake House of America" ,"ABC Center Building", "Cent. Bldg ABC", "BD Home 25 New", "Boarding Direct 25", "Pine Recreational Center", "Pine Rec. cntR", "Boston Swimming Complex", "boston gym center", "mas hospital" , "Massachusetts Hospital" )

blocking_var = c(1, 1,1,1, 1, 2,2,2,2,3,3)
            
my_data = data.frame(address, name, blocking_var)
The data looks something like this:

> my_data
                                               address                     name blocking_var
1                        44 Ocean Road Atlanta Georgia Pancake House of America            1
2                       882 4N Road River NY, NY 12345      ABC Center Building            1
3                       882 - River Road NY, ZIP 12345           Cent. Bldg ABC            1
4                    123 Fake Road Boston Drive Boston           BD Home 25 New            1
5                           123 Fake - Rd Boston 56789       Boarding Direct 25            1
6                                  3665 Apt 5 Moon Crs Pine Recreational Center            2
7                              3665 Unit Moon Crescent           Pine Rec. cntR            2
8                                  NO ADDRESS PROVIDED  Boston Swimming Complex            2
9                                   31 Silver Way Road        boston gym center            2
10 1800 Orleans St, Baltimore, MD 21287, United States             mas hospital            3
11                  1799 Orlans Street, Maryland , USA   Massachusetts Hospital            3

I am trying to follow this R tutorial (https://cran.r-project.org/web/packages/RecordLinkage/vignettes/WeightBased.pdf) and learn how to remove duplicates based on fuzzy conditions. The goal (within each "block") is to keep all unique records - and for fuzzy duplicates, only keep the first occurrence of the duplicate.

I tried the following code:

library(RecordLinkage)
pairs=compare.dedup(my_data, blockfld=3)

But when I inspect the results, everything is NA - given these results, I think I am doing something wrong and there does not seem to be any point in continuing until this error is resolved.

Can someone please show me how I can resolve this problem and continue on with the tutorial?

Thank you!

williaml · November 14, 2022, 8:52pm

Is this question the same as this one? Removing Fuzzy Duplicates in R? - General - RStudio Community

omario · November 14, 2022, 8:57pm

@williaml this is a different question. in the previous question, I am trying to remove duplicate rows using the Levenstein Distance. In this question, I am trying to remove duplicates using a probabilistic record linkage approach. thanks!

system · December 26, 2022, 8:57pm

This topic was automatically closed 42 days after the last reply. New replies are no longer allowed.

If you have a query related to it or one of the replies, start a new topic and refer back with a link.