Finding All Combinations of Parent/Child IDs

joels · February 25, 2019, 5:11am

I think what you're looking for can be described as a graph where the graph edges are the sequence of connections from successive parents and children. The code below uses the igraph package to get the desired result.

library(igraph)
library(tidyverse)

df <- data.frame(pid = c(1,2,3,5),
                 cid = c(3,4,5,6))

wanted_df <- data.frame(pid = c(1, 1, 1, 2, 3, 3, 5),
                        cid = c(3, 5, 6, 4, 5, 6, 6))

df_g = graph_from_data_frame(df, directed=TRUE)

wanted_df2 = map(V(df_g), ~ names(subcomponent(df_g, .x, mode="out"))) %>% 
  # Convert the list output to a data frame
  map_df(~data.frame(cid=.x), .id="pid") %>% 
  # Get rid of rows where `pid` and `cid` are equal
  filter(pid != cid) %>% 
  # Convert columns (from character) to numeric. Not necessary, but this makes the columns the same mode as the columns in `wanted_df`.
  mutate(pid=as.numeric(pid),
         cid=as.numeric(cid))

wanted_df2
#>   pid cid
#> 1   1   3
#> 2   1   5
#> 3   1   6
#> 4   2   4
#> 5   3   5
#> 6   3   6
#> 7   5   6

identical(wanted_df, wanted_df2)
#> [1] TRUE

^{Created on 2019-02-24 by the reprex package (v0.2.1)}

I haven't spent much time with the igraph package so I suspect my code is clunkier than it needs to be, but here's the general idea:

Convert df to a graph: graph_from_data_frame(df, directed=TRUE).
For a given vertex, get all vertices that can be reached from that vertex. We can see the graph we just created with plot(df_g).

00%20PM1414×1124 41.2 KB

Note that the plot shows visually that we can reach vertices 3, 5, and 6 from vertex 1; we can reach vertices 5 and 6 from vertex 3; etc. These are the relationships we wanted to capture by encoding your data as a graph.

To get all vertices that can be reached from a given vertex, we use the subcomponent function. Here is the result from running subcomponent for vertex 1:
```
subcomponent(df_g, 1, mode="out")
+ 4/6 vertices, named, from 9c56ae3:
[1] 1 3 5 6
```
Note that this returns c(1,3,5,6), which is what we wanted. You can reach vertices 3, 5, and 6 from vertex 1 (and you can also of course reach vertex 1 from vertex 1).
Now we need to run the subcomponent function on every vertex in the graph df_g. We do that with the map function: map(V(df_g), ~ names(subcomponent(df_g, .x, mode="out"))).

This returns a list containing the output of subcomponent for each vertex. The rest of the code is just to extract and reshape the output into a data frame. igraph functions seem to return complex objects whose structure and behavior I don't really understand, which is why I expect my code can be simplified with a little more igraph knowledge.