Hey everybody, I have been running this through my head multiple times for the past few days and cannot come up with an elegant/tidy conclusion. I have two columns of IDs. Example data:
Every child id can be a parent id. The output should be two columns, where the pid can be a chain of all of its children. From the example, we can see 1 relates to 3 which relates to 5, etc. Those pairs should be their own row. The length of relationships can vary from dataset to dataset.
Are recursive functions best for this? I was trying to part with my ugly while loop for a possible purrr/map way.
Any help and/or direction in the best path for this is greatly appreciated!
I think what you're looking for can be described as a graph where the graph edges are the sequence of connections from successive parents and children. The code below uses the igraph package to get the desired result.
library(igraph)
library(tidyverse)
df <- data.frame(pid = c(1,2,3,5),
cid = c(3,4,5,6))
wanted_df <- data.frame(pid = c(1, 1, 1, 2, 3, 3, 5),
cid = c(3, 5, 6, 4, 5, 6, 6))
df_g = graph_from_data_frame(df, directed=TRUE)
wanted_df2 = map(V(df_g), ~ names(subcomponent(df_g, .x, mode="out"))) %>%
# Convert the list output to a data frame
map_df(~data.frame(cid=.x), .id="pid") %>%
# Get rid of rows where `pid` and `cid` are equal
filter(pid != cid) %>%
# Convert columns (from character) to numeric. Not necessary, but this makes the columns the same mode as the columns in `wanted_df`.
mutate(pid=as.numeric(pid),
cid=as.numeric(cid))
wanted_df2
#> pid cid
#> 1 1 3
#> 2 1 5
#> 3 1 6
#> 4 2 4
#> 5 3 5
#> 6 3 6
#> 7 5 6
identical(wanted_df, wanted_df2)
#> [1] TRUE
Note that the plot shows visually that we can reach vertices 3, 5, and 6 from vertex 1; we can reach vertices 5 and 6 from vertex 3; etc. These are the relationships we wanted to capture by encoding your data as a graph.
To get all vertices that can be reached from a given vertex, we use the subcomponent function. Here is the result from running subcomponent for vertex 1:
Note that this returns c(1,3,5,6), which is what we wanted. You can reach vertices 3, 5, and 6 from vertex 1 (and you can also of course reach vertex 1 from vertex 1).
Now we need to run the subcomponent function on every vertex in the graph df_g. We do that with the map function: map(V(df_g), ~ names(subcomponent(df_g, .x, mode="out"))).
This returns a list containing the output of subcomponent for each vertex. The rest of the code is just to extract and reshape the output into a data frame. igraph functions seem to return complex objects whose structure and behavior I don't really understand, which is why I expect my code can be simplified with a little more igraph knowledge.