Hi all!
I'm using the tidyverse to aggregate some information about websites which has several different (and unfortunately conflicting) identifiers. I can't share my data, but for example there are three columns: one with a "unique" ID number, one with the website name, and one with the URL--except sometimes two sites will have the same name but different ID numbers, or the same URL but different IDs, as below:
data <- data.frame(
id = c(1, 2, 2, 3),
name = c("Bed Bath and Beyond", "Bed Bath & Beyond", "Bed Bath & Beyond", "Bed Bath & Beyond"),
url = c("bedbathandbeyond.com", "bedbathandbeyond.com", "bedbathandbeyond.com", "www.bedbathandbeyond.com")
)
As you can see, none of these individual variables is sufficient to group all of the relevant entries together--and using all three variables in a group_by
statement would result in three separate rows that should only be one.
All this to say, is it possible to do a group_by
that checks for matches on id OR name OR url
? Right now I'm using complicated left_join
s but it's loftier than I'd like.
TIA!