Efficient way to create distance-like matrix from data frame

rpodcast · June 5, 2018, 9:22pm

I'm developing a pipeline in R that involves a user supplying information on how different nodes in a network are connected, and I'm hoping to get back a matrix that shows the same information but in a distance-type layout. Here's a simple example. Suppose the user supplies a data frame that gives the strength of connections between 4 nodes:

library(tidyverse)
edges_set <- tibble::tribble(
  ~from, ~to, ~weight,
  "node1", "node2", 1,
  "node2", "node3", 0.4,
  "node2", "node4", 0.6,
  "node3", "node4", 0.8,
  "node3", "node2", 0.2,
  "node4", "node1", 1
) %>% as.data.frame

The analysis functions I'm using depend on having a matrix layout similar to a distance matrix between these nodes above. I have a "brute-force" approach to making this matrix that involves an inefficient use of loops:

nodes_names <- paste0("node", 1:4)
# set up weight matrix
m <- matrix(0, nrow=length(nodes_names), ncol=length(nodes_names))
rownames(m) <- colnames(m) <- nodes_names

# populate edge weights in appropriate matrix elements
for(dd in 1:nrow(edges_set)) {
  row_id <- edges_set[dd, "from"]
  col_id <- edges_set[dd, "to"]
  m[row_id, col_id] <- as.numeric(edges_set[dd, "weight"])
}

The performance is fine for a small network, but the typical use cases for the analysis involve potentially many more nodes and thus many more edges. Is there a vectorized way of converting the two-column edges data frame above to the matrix layout? I'm trying to squeeze every bit of performance for this pipeline.

joels · June 5, 2018, 9:43pm

You can do this with spread:

edge_mat = edges_set %>% 
  spread(to, weight, fill=0) %>% 
  column_to_rownames(var="from") %>% 
  as.matrix

edge_mat

      node1 node2 node3 node4
node1     0   1.0   0.0   0.0
node2     0   0.0   0.4   0.6
node3     0   0.2   0.0   0.8
node4     1   0.0   0.0   0.0

identical(m, edge_mat)

[1] TRUE

You can also use the igraph package to calculate the shortest distance between nodes, based on the edge weights, regardless of whether they are directly connected. For example:

library(igraph)
g = graph_from_data_frame(edges_set, directed=TRUE)

plot(g, vertex.size=40, edge.width=5*edge.attributes(g)[["weight"]])

Rplot01

# Distance from a given node to other nodes (sum of weights along shortest path)
distances(g, mode="out")

      node1 node2 node3 node4
node1   0.0   1.0   1.4   1.6
node2   1.6   0.0   0.4   0.6
node3   1.8   0.2   0.0   0.8
node4   1.0   2.0   2.4   0.0

rpodcast · June 5, 2018, 10:34pm

@joels That's an excellent use of tidyr::spread that I had no idea could be done! Excellent solution and this will make one less performance bottleneck in my analyses