De-duplicating names by approximation

Hi, I'm asking for advice on how to approach a problem. I have data regarding state spending. One set holds payments to vendors. The only information given is the vendor name. I'd like to de-duplicate the list so that I can associate payments with vendors. As you might imagine, there are many instances where there are names like these:

ACCESSIBILITY DOT NET

ACCESSIBILITY DOT NET LLC

These names refer to the same vendor. Also:

ZWONITZER INC

ZWONITZER PROPANE INC

Again, these are the same vendor. I have about 150K distinct vendor names as found by using dplyr::distinct(). There are about 10390K payments. Being in Kansas, about 1200 names start with "KANSAS."

I'm aware of approximate string matching like in the R "stringdist" package. These methods usually accept two strings and compute a score. I don't think I can try to match each vendor name with all other vendor names. So that's why I need advice. I don't need code; I just need an idea of how to proceed.

Retrieving an old cap as a corporate lawyer—it depends.

On the one hand there may be distinct legal entities, such as Farmer's Propane of Shawnee Mission, L.P. and Farmer's Propane of Hutchinson, L.P. Assuming they don't have overlapping market areas, the could both be doing business as Farmer's Propane. They might be controlled by siblings and Farmer's Propane was a parent's proprietorship and each sibling inherited part of the business and operate separately. Or Farmer's Propane may have been a general partnership to which the siblings succeeded and the partnership owns both of the businesses and they are operated under common control for the benefit of the partnership.

The possibilities multiply—I won't go into the offshore shell company variations.

That raises the question about the goal: legal distinctiveness, operational distinctiveness or just minimal nominal distinctiveness. And that depends on the purpose of the analysis.

Let's suppose that nominal distinctiveness is the goal because doing the other is madness for 150K names.

Here's a framework:

Every R problem can be thought of with advantage as the interaction of three objects— an existing object, x , a desired object,y , and a function, f, that will return a value of y given x as an argument. In other words, school algebra— f(x) = y. Any of the objects can be composites.

Here x can be a list of two vectors, vendor and payee, and y is a vector consisting of the intersection of x and y after each has been subjected to a function, f, to be composed, that makes each element of x and y nominally distinct internally.

Let f_1 be a function that removes stopwords, a vector, stops of the comment suffix identifiers

stops <- c("Inc","Inc.","Incorporated","Corporation","Corp","Corp.","Company","LP","L.P","LLC","L.L.C")

(Scanning the ends of names helps identify the candidates present in each vector. Use a {stringr} or other regex.)

f_1 <-  function(x) x %in% stops
holder <- vector()
for(i in seq_along(x) holder[i] = f_1(i)
holder <- vector()
setdiff(x,holder)
for(i in seq_along(y) holder[i] = f_1(i)
setdiff(y,holder)

With those results, an f_2 could look with unique(union(.,.) for identical names in both lists without regard to the stopwords. Those can be set aside and then taken from the vectors to reduce the search space.

Another f_n# would be to extract the records that begin with Kansas, chop off the stopwords, and work from the last word back to find the unique names.

Etc.

This topic was automatically closed 21 days after the last reply. New replies are no longer allowed.

If you have a query related to it or one of the replies, start a new topic and refer back with a link.