Hey y'all!
I'm working with a very large, messy dataset and I need some help with a problem beyond my current knowledge.
There are two variables: col A is a list of numeric IDs, col B is a character description. Sometimes col B will contain a reference to a different ID. I already wrote code to extract IDs from col B into col C and convert them to numeric, so they can be matched with col A.
My goal is to take all of the IDs that relate to each other and combine their descriptions into one.
Current dataset:
| A |
B |
| 01 |
Cat 02 |
| 02 |
Dog |
| 03 |
Rabbit 04 |
| 04 |
Hamster 03 |
| 05 |
Dolphin 01 |
Goal:
| A |
B |
| 01 |
Cat 02 - Dog - Dolphin 01 |
| 02 |
Cat 02 - Dog - Dolphin 01 |
| 03 |
Rabbit 04 - Hamster 03 |
| 04 |
Rabbit 04 - Hamster 03 |
| 05 |
Cat 02 - Dog - Dolphin 01 |
So far I can extract IDs into col C,
| A |
B |
C |
| 01 |
Cat 02 |
02 |
| 02 |
Dog |
NA |
| 03 |
Rabbit 04 |
04 |
| 04 |
Hamster 03 |
03 |
| 05 |
Dolphin 01 |
01 |
and then col D returns the match in col A
| A |
B |
C |
D |
| 01 |
Cat 02 |
02 |
Dog |
| 02 |
Dog |
NA |
NA |
| 03 |
Rabbit 04 |
04 |
Hamster 03 |
| 04 |
Hamster 03 |
03 |
Rabbit 04 |
| 05 |
Dolphin |
01 |
Cat 02 |
I've been using unite, group_by %>% mutate str_flatten, and separate_rows to some success. My problem is I'll end up with rows like
| A |
B |
| 01 |
Cat 02 - Dog |
| 02 |
Dog |
| 03 |
Rabbit 04 - Hamster 03 |
| 04 |
Rabbit 04 - Hamster 03 |
| 05 |
Cat 02 - Dog - Dolphin 01 |
where I'm not catching all unique descriptions for each item in a group and only some rows will contain all the data. Really I'm having trouble trying to group in general because sometimes there's recursions, sometimes multiple As relate to one C and then multiple Cs will relate to one A...
Does anyone have advice for which functions would be useful in this situation? Or a different approach that would be better? Let me know if I need to provide more information to solve.