I'm using the tidyverse to aggregate some information about websites which has several different (and unfortunately conflicting) identifiers. I can't share my data, but for example there are three columns: one with a "unique" ID number, one with the website name, and one with the URL--except sometimes two sites will have the same name but different ID numbers, or the same URL but different IDs, as below:
data <- data.frame( id = c(1, 2, 2, 3), name = c("Bed Bath and Beyond", "Bed Bath & Beyond", "Bed Bath & Beyond", "Bed Bath & Beyond"), url = c("bedbathandbeyond.com", "bedbathandbeyond.com", "bedbathandbeyond.com", "www.bedbathandbeyond.com") )
As you can see, none of these individual variables is sufficient to group all of the relevant entries together--and using all three variables in a
group_by statement would result in three separate rows that should only be one.
All this to say, is it possible to do a
group_by that checks for matches on
id OR name OR url? Right now I'm using complicated
left_joins but it's loftier than I'd like.