Help with grouping and mutating a messy dataset

Hey y'all!

I'm working with a very large, messy dataset and I need some help with a problem beyond my current knowledge.

There are two variables: col A is a list of numeric IDs, col B is a character description. Sometimes col B will contain a reference to a different ID. I already wrote code to extract IDs from col B into col C and convert them to numeric, so they can be matched with col A.

My goal is to take all of the IDs that relate to each other and combine their descriptions into one.

Current dataset:

A B
01 Cat 02
02 Dog
03 Rabbit 04
04 Hamster 03
05 Dolphin 01

Goal:

A B
01 Cat 02 - Dog - Dolphin 01
02 Cat 02 - Dog - Dolphin 01
03 Rabbit 04 - Hamster 03
04 Rabbit 04 - Hamster 03
05 Cat 02 - Dog - Dolphin 01

So far I can extract IDs into col C,

A B C
01 Cat 02 02
02 Dog NA
03 Rabbit 04 04
04 Hamster 03 03
05 Dolphin 01 01

and then col D returns the match in col A

A B C D
01 Cat 02 02 Dog
02 Dog NA NA
03 Rabbit 04 04 Hamster 03
04 Hamster 03 03 Rabbit 04
05 Dolphin 01 Cat 02

I've been using unite, group_by %>% mutate str_flatten, and separate_rows to some success. My problem is I'll end up with rows like

A B
01 Cat 02 - Dog
02 Dog
03 Rabbit 04 - Hamster 03
04 Rabbit 04 - Hamster 03
05 Cat 02 - Dog - Dolphin 01

where I'm not catching all unique descriptions for each item in a group and only some rows will contain all the data. Really I'm having trouble trying to group in general because sometimes there's recursions, sometimes multiple As relate to one C and then multiple Cs will relate to one A...

Does anyone have advice for which functions would be useful in this situation? Or a different approach that would be better? Let me know if I need to provide more information to solve.

This is not the goal, because I am unsure of the logic. This implements constructing a string with the values of the animal, the id of the animal it references, and the animal that references it.

# re-code to clear air; provide meaningful variable names, symbols for id
d <- readr::read_csv("~/Desktop/grist.csv")
#> Rows: 5 Columns: 2
#> ── Column specification ────────────────────────────────────────────────────────
#> Delimiter: ","
#> chr (1): critter
#> dbl (1): id
#> 
#> ℹ Use `spec()` to retrieve the full column specification for this data.
#> ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
d
#> # A tibble: 5 × 2
#>      id critter  
#>   <dbl> <chr>    
#> 1     1 cat 2    
#> 2     2 dog      
#> 3     3 rabbit 4 
#> 4     4 hamster 3
#> 5     5 dolphin 1
# cross referenced critter
pat1 <- "[0-9]"
pat2 <- "[A-Za-z]+"
buddies <- stringr::str_extract(d$critter,pat1) |> as.integer()
crits <- stringr::str_extract(d$critter,pat2) 
d$buddy <- buddies
d$critter <- crits
d
#> # A tibble: 5 × 3
#>      id critter buddy
#>   <dbl> <chr>   <int>
#> 1     1 cat         2
#> 2     2 dog        NA
#> 3     3 rabbit      4
#> 4     4 hamster     3
#> 5     5 dolphin     1

# record refers to another record
d$refers_to <- d[d$buddy,][2][[1]]

# record referred to by another record

find_referral <- function(x) d$critter[which(d$refers_to %in% x)]
referrals <- unlist(lapply(d$critter,find_referral))
referrals
#> [1] "dolphin" "cat"     "hamster" "rabbit"

# record not referred to by another record
orphans <- referrals[!(referrals %in% d$refers_to)]

# create column of referring columns
d$referrals <- c(referrals,orphans)
# eliminate "self-references"
d$referrals[which(d$critter == d$referrals)] <- NA

# create suite variable
d$suite <- paste(d$critter,stringr::str_pad(as.character(d$buddy),2,"left","0"),d$refers_to,d$referrals,sep = "-")

d[,c(1,6)]
#> # A tibble: 5 × 2
#>      id suite                    
#>   <dbl> <chr>                    
#> 1     1 cat-02-dog-dolphin       
#> 2     2 dog-NA-NA-cat            
#> 3     3 rabbit-04-hamster-hamster
#> 4     4 hamster-03-rabbit-rabbit 
#> 5     5 dolphin-01-cat-NA

This topic was automatically closed 21 days after the last reply. New replies are no longer allowed.

If you have a query related to it or one of the replies, start a new topic and refer back with a link.