I’m using the tidyverse to aggregate some information about websites which has several different (and unfortunately conflicting) identifiers. I can’t share my data, but for example there are three columns: one with a “unique” ID number, one with the website name, and one with the URL–except sometimes two sites will have the same name but different ID numbers, or the same URL but different IDs, as below:
data <- data.frame( id = c(1, 2, 2, 3), name = c("Bed Bath and Beyond", "Bed Bath & Beyond", "Bed Bath & Beyond", "Bed Bath & Beyond"), url = c("bedbathandbeyond.com", "bedbathandbeyond.com", "bedbathandbeyond.com", "www.bedbathandbeyond.com") )
As you can see, none of these individual variables is sufficient to group all of the relevant entries together–and using all three variables in a
group_by statement would result in three separate rows that should only be one.
All this to say, is it possible to do a
group_by that checks for matches on
id OR name OR url? Right now I’m using complicated
left_joins but it’s loftier than I’d like.