`group_by` with OR

Hi all!

I'm using the tidyverse to aggregate some information about websites which has several different (and unfortunately conflicting) identifiers. I can't share my data, but for example there are three columns: one with a "unique" ID number, one with the website name, and one with the URL--except sometimes two sites will have the same name but different ID numbers, or the same URL but different IDs, as below:

data <- data.frame(
    id = c(1, 2, 2, 3),
    name = c("Bed Bath and Beyond", "Bed Bath & Beyond", "Bed Bath & Beyond", "Bed Bath & Beyond"),
    url = c("bedbathandbeyond.com", "bedbathandbeyond.com", "bedbathandbeyond.com", "www.bedbathandbeyond.com")
)

As you can see, none of these individual variables is sufficient to group all of the relevant entries together--and using all three variables in a group_by statement would result in three separate rows that should only be one.

All this to say, is it possible to do a group_by that checks for matches on id OR name OR url? Right now I'm using complicated left_joins but it's loftier than I'd like.

TIA!

1 Like

Maybe you could use regex to try and boil the name or the url down to something that captures the variations? I'm terrible at regex, but even if you just did something like extracting the naked domain name from the url column, you could something suitable.

Is there a reason you can't group by only URL? It seems like the URL would differentiate different sites regardless of what the ID / name is. You can than generate a new name (and/or ID) based on the URL.

You could use adist or extended versions in the excellent stringdist to calculate the edit distance between strings, and cluster accordingly. On sizable data where you can't check if the results make sense, it may be a pain, though.

If you've got a correct set to match, it's easier.

2 Likes

Yea! It's inconsistent whether or not a URL starts with www. or https:// or both or nothing at all; this is the column with the smallest amount of variation, though, so it may be the best to use with regex.

Yeah, we're working with 100k+ rows of data into about 1300 sites :confused:

That said, this package may be really useful for some other stuff I'm working on, so thanks!!

This may be what I have to do :frowning:

Yep- seems like the best option, in my opinion, is to do some data cleaning on one of these columns to get a clean grouping variable and then use that. The URL should be the easiest because you can remove any values such as https://, www., .com and you will be left with the host (e.g. bedbathandbeyond).

1 Like

urltools::url_parse looks like a good way to do this, @queermath and @kentm :slight_smile: Would probably need to pair it with purrr:map to just pull the domain out of a column of URLs!

4 Likes

@rensa Nice! I don't work much with web data but this is a nice package to know about. Thanks for the recommendation.

1 Like

ooh! i ended up working out 95% of the problem by removing the http(s)://www. from the URLs and hard-coding the rest; will have to see if this brings me 98% of the way there on monday! will also be helpful for future projects, i'm sure.

1 Like