Network clustering in R (tidymodels maybe)?

Given I have two columns:

start_node  end_node
--------------------
        x1        x2
        x1        x3
        x2        x4
...

Is there a way to

  1. Cluster the values (x1, x2, x3, etc.)
  2. Find out which values are in which clusters (c1, c2, etc.)?

Visually, it would be something like this:

Where the little dots would be various x values, and 'Lindsey Brown' 'Marion Doyle' etc. would be c1, c2, etc.

So I'm imagining the output to look something like this

 c1  c2  ...
-------
 x1  x3  ...
 x2  x4  ...
... ...

The most common packages for working with graphs are {igraph}, {network}, and {tidygraph}. The best approach would be to load your data in one of those packages, which then offer a number of algorithms for clustering (also called "community detection" in this context). For exemple, see all the cluster_*() functions in igraph, and the group_*() functions in tidygraph.

In your case it looks like you have a directed network, so some algorithms will not work (you can decide to ignore the directionality). Clustering in general is a hard problem: there is no single best algorithm that always work on every dataset; you may have to experiment with existing algorithms.

For example:

library(igraph)
#> 
#> Attaching package: 'igraph'
#> The following objects are masked from 'package:stats':
#> 
#>     decompose, spectrum
#> The following object is masked from 'package:base':
#> 
#>     union
set.seed(123)
df <- data.frame(start_node = paste0("x", sample(1:7, replace = TRUE)),
           end_node = paste0("x", sample(1:7, replace = TRUE))) |>
  dplyr::filter(start_node != end_node)
df
#>   start_node end_node
#> 1         x7       x6
#> 2         x7       x3
#> 3         x3       x5
#> 4         x6       x4
#> 5         x3       x6
#> 6         x2       x6
#> 7         x2       x1

gr <- igraph::graph_from_data_frame(df,
                                    directed = TRUE)
plot(gr)

cluster_spinglass(gr)
#> IGRAPH clustering spinglass, groups: 2, mod: 0.2
#> + groups:
#>   $`1`
#>   [1] "x7" "x3" "x5"
#>   
#>   $`2`
#>   [1] "x6" "x2" "x4" "x1"
#> 



gr <- igraph::graph_from_data_frame(df,
                                    directed = FALSE)
plot(gr)

cluster_louvain(gr)
#> IGRAPH clustering multi level, groups: 3, mod: 0.21
#> + groups:
#>   $`1`
#>   [1] "x7" "x3" "x5"
#>   
#>   $`2`
#>   [1] "x6" "x4"
#>   
#>   $`3`
#>   [1] "x2" "x1"
#> 

Created on 2022-04-29 by the reprex package (v2.0.1)

1 Like

Thanks :slight_smile:

At the moment, I'm exploring options, so direction would not matter.

What does matter though, is the output I mentioned in the original post. Would it be possible to know which values/columns belong to which clusters?

In my previous example code, the clusters are given by:

#> + groups:
#>   $`1`
#>   [1] "x7" "x3" "x5"
#>   
#>   $`2`
#>   [1] "x6" "x4"
#>   
#>   $`3`
#>   [1] "x2" "x1"
#> 

which means that cluster 1 contains nodes "x3", "x5", and "x7", cluster 2 contains x4 and x6 etc.

To find in which cluster a given node belongs, you can run something like map_lgl(cluster_list, ~ my_cluster %in% .x) or invert the whole list all at once with purrr::transpose().

This is the output of {igraph}, if you use {tidygraph} you have the same clustering algorithms behind the scene, but the result is returned as a vector of length number-of-nodes, where each element is the cluster this node belongs to.

I don't think that's directly possible: most of the time clusters do not all have the same size, so you wuld make a data.frame with columns of different lengths. That wouldn't be a data.frame anymore, just a normal list, and it's exactly what {igraph} is giving you.

1 Like

And sorry, I should have added that at the beginning: while R is great and adds a lot of power, if all you want to do is plot and cluster a graph you might want to consider interactive graph software such as Gephi or Cytoscape.

This topic was automatically closed 7 days after the last reply. New replies are no longer allowed.

If you have a query related to it or one of the replies, start a new topic and refer back with a link.