joining data-frame to itself (with different cols), then getting rid of half the rows?

Hello everyone,

this is hopefully a simple one. Let's say I have

test <- tribble(
  ~id1, ~id2, ~something, ~different,
  1,    2,    "nyan",     "cat",
  2,    1,    "blue",     "elephant",
  3,    4,    "squeaky",  "mouse",
  4,    3,    "hard",     "rock"
)

I want to end up with a tibble that gets rid of rows with duplicate ids, but contain all the information. So in this case, my result would look like this:

tribble(
  ~id1, ~id2, ~something_1, ~different_1, ~something_2, ~different_2,
  1,    2,    "nyan",       "cat",        "blue",       "elephant",
  3,    4,    "squeaky",    "mouse",      "hard",       "rock"
)

What's the best way to achieve this? I tried working with left_join:
left_join(test, test, by = c("id1" = "id2"), suffix = c("_1", "_2"))
...but then I still don't know how to get rid of the "duplicate" rows (and I have "id2" in there twice...).
Thank you!

If you wanted to get rid of rows with duplicate ids, you could use distinct(). But that's not what you want to do here, since you want to keep the information. Your example looks more like a pivot_wider().

One thing is not clear though: will you always have 2 rows for each id, or can the number of "duplicates" vary. In that second case, how do you want to handle it: you could make a data.frame as wide as the id with most duplicates, and fill all the others with NA (using pivot_wider). You could make a list where each element corresponds to an id and contains a subsetted tibble (using map and filter). Ultimately, that all depends what you want to use the result for.

Oh, and just to be clear, if I understand correctly, the order of id1 and id2 doesn't matter, right? Basically it would be easier with defining a id3 = paste(sort(c(id1, id2)), collapse = " ")?

Thank you for your reply!

Yes, there are always 2 rows per id, with the ids once in id1 and once in id2.
I actually used an inner_join for now, then deleted all the duplicate rows by filtering for id2>id1 (This also works for Strings!). Not sure if that's an ideal / tidy approach, but it works:

inner_join(test, test, by = c("id1" = "id2"), suffix = c("_1", "_2")) %>%
  filter(id2>id1) %>%
  select(-id1_2)

# A tibble: 2 x 6
    id1   id2 something_1 different_1 something_2 different_2
  <dbl> <dbl> <chr>       <chr>       <chr>       <chr>      
1     1     2 nyan        cat         blue        elephant   
2     3     4 squeaky     mouse       hard        rock       

would pivot_wider() be a better choice here?

The most important is that it works and is clear, so I'd say your solution with a join is perfectly fine.

Actually I tried to make it work with pivot_wider and that seemed less readable.

1 Like

This topic was automatically closed 7 days after the last reply. New replies are no longer allowed.