Finding All Combinations of Parent/Child IDs

Hey everybody, I have been running this through my head multiple times for the past few days and cannot come up with an elegant/tidy conclusion. I have two columns of IDs. Example data:

df <- data.frame(pid = c(1,2,3,5),
                 cid = c(3,4,5,6))
df
#>   pid cid
#> 1   1   3
#> 2   2   4
#> 3   3   5
#> 4   5   6

wanted_df <- data.frame(pid = c(1, 1, 1, 2, 3, 3, 5),
                        cid = c(3, 5, 6, 4, 5, 6, 6))

wanted_df
#>   pid cid
#> 1   1   3
#> 2   1   5
#> 3   1   6
#> 4   2   4
#> 5   3   5
#> 6   3   6
#> 7   5   6

Created on 2019-02-24 by the reprex package (v0.2.1)

Every child id can be a parent id. The output should be two columns, where the pid can be a chain of all of its children. From the example, we can see 1 relates to 3 which relates to 5, etc. Those pairs should be their own row. The length of relationships can vary from dataset to dataset.

Are recursive functions best for this? I was trying to part with my ugly while loop for a possible purrr/map way.

Any help and/or direction in the best path for this is greatly appreciated!

Thanks,

Kyle

I think what you're looking for can be described as a graph where the graph edges are the sequence of connections from successive parents and children. The code below uses the igraph package to get the desired result.

library(igraph)
library(tidyverse)

df <- data.frame(pid = c(1,2,3,5),
                 cid = c(3,4,5,6))

wanted_df <- data.frame(pid = c(1, 1, 1, 2, 3, 3, 5),
                        cid = c(3, 5, 6, 4, 5, 6, 6))

df_g = graph_from_data_frame(df, directed=TRUE)

wanted_df2 = map(V(df_g), ~ names(subcomponent(df_g, .x, mode="out"))) %>% 
  # Convert the list output to a data frame
  map_df(~data.frame(cid=.x), .id="pid") %>% 
  # Get rid of rows where `pid` and `cid` are equal
  filter(pid != cid) %>% 
  # Convert columns (from character) to numeric. Not necessary, but this makes the columns the same mode as the columns in `wanted_df`.
  mutate(pid=as.numeric(pid),
         cid=as.numeric(cid))

wanted_df2
#>   pid cid
#> 1   1   3
#> 2   1   5
#> 3   1   6
#> 4   2   4
#> 5   3   5
#> 6   3   6
#> 7   5   6

identical(wanted_df, wanted_df2)
#> [1] TRUE

Created on 2019-02-24 by the reprex package (v0.2.1)

I haven't spent much time with the igraph package so I suspect my code is clunkier than it needs to be, but here's the general idea:

  1. Convert df to a graph: graph_from_data_frame(df, directed=TRUE).

  2. For a given vertex, get all vertices that can be reached from that vertex. We can see the graph we just created with plot(df_g).

    Note that the plot shows visually that we can reach vertices 3, 5, and 6 from vertex 1; we can reach vertices 5 and 6 from vertex 3; etc. These are the relationships we wanted to capture by encoding your data as a graph.

    To get all vertices that can be reached from a given vertex, we use the subcomponent function. Here is the result from running subcomponent for vertex 1:

    subcomponent(df_g, 1, mode="out")
    + 4/6 vertices, named, from 9c56ae3:
    [1] 1 3 5 6
    

    Note that this returns c(1,3,5,6), which is what we wanted. You can reach vertices 3, 5, and 6 from vertex 1 (and you can also of course reach vertex 1 from vertex 1).

  3. Now we need to run the subcomponent function on every vertex in the graph df_g. We do that with the map function: map(V(df_g), ~ names(subcomponent(df_g, .x, mode="out"))).

    This returns a list containing the output of subcomponent for each vertex. The rest of the code is just to extract and reshape the output into a data frame. igraph functions seem to return complex objects whose structure and behavior I don't really understand, which is why I expect my code can be simplified with a little more igraph knowledge.

5 Likes

This topic was automatically closed 21 days after the last reply. New replies are no longer allowed.

If you have a query related to it or one of the replies, start a new topic and refer back with a link.