# Efficient way to create distance-like matrix from data frame

#1

I'm developing a pipeline in R that involves a user supplying information on how different nodes in a network are connected, and I'm hoping to get back a matrix that shows the same information but in a distance-type layout. Here's a simple example. Suppose the user supplies a data frame that gives the strength of connections between 4 nodes:

``````library(tidyverse)
edges_set <- tibble::tribble(
~from, ~to, ~weight,
"node1", "node2", 1,
"node2", "node3", 0.4,
"node2", "node4", 0.6,
"node3", "node4", 0.8,
"node3", "node2", 0.2,
"node4", "node1", 1
) %>% as.data.frame
``````

The analysis functions I'm using depend on having a matrix layout similar to a distance matrix between these nodes above. I have a "brute-force" approach to making this matrix that involves an inefficient use of loops:

``````nodes_names <- paste0("node", 1:4)
# set up weight matrix
m <- matrix(0, nrow=length(nodes_names), ncol=length(nodes_names))
rownames(m) <- colnames(m) <- nodes_names

# populate edge weights in appropriate matrix elements
for(dd in 1:nrow(edges_set)) {
row_id <- edges_set[dd, "from"]
col_id <- edges_set[dd, "to"]
m[row_id, col_id] <- as.numeric(edges_set[dd, "weight"])
}
``````

The performance is fine for a small network, but the typical use cases for the analysis involve potentially many more nodes and thus many more edges. Is there a vectorized way of converting the two-column edges data frame above to the matrix layout? I'm trying to squeeze every bit of performance for this pipeline.

#2

You can do this with `spread`:

``````edge_mat = edges_set %>%
column_to_rownames(var="from") %>%
as.matrix

edge_mat
``````
``````      node1 node2 node3 node4
node1     0   1.0   0.0   0.0
node2     0   0.0   0.4   0.6
node3     0   0.2   0.0   0.8
node4     1   0.0   0.0   0.0
``````
``````identical(m, edge_mat)
``````

[1] TRUE

You can also use the `igraph` package to calculate the shortest distance between nodes, based on the edge weights, regardless of whether they are directly connected. For example:

``````library(igraph)
g = graph_from_data_frame(edges_set, directed=TRUE)

plot(g, vertex.size=40, edge.width=5*edge.attributes(g)[["weight"]])
``````

``````# Distance from a given node to other nodes (sum of weights along shortest path)
distances(g, mode="out")
``````
``````      node1 node2 node3 node4
node1   0.0   1.0   1.4   1.6
node2   1.6   0.0   0.4   0.6
node3   1.8   0.2   0.0   0.8
node4   1.0   2.0   2.4   0.0
``````

#3

@joels That's an excellent use of `tidyr::spread` that I had no idea could be done! Excellent solution and this will make one less performance bottleneck in my analyses