Creating dendrogram efficiently from a list of numeric vectors

ryanwesslen · June 1, 2018, 1:53pm

I have a tibble that is two columns: an id string and a list of numeric vectors.

Naively, I can create a dendrogram by row binding the list in a do.call then running the distance.

# data (read in via MongoDB, i.e. mongolite); use runif for random samples
t <- tibble(id = c("123", "234", "345", "456"),
            l = list(runif(3), runif(3), runif(3), runif(3)))

# combine list into data.frame
z <- do.call(rbind.data.frame, t$l) 

# run dendrogram
dend <- as.dendrogram(hclust(dist(z)))

However, this approach scales terribly. My original dataset is 380k rows with each record matched to a vector of 512 elements (numeric).

What is an efficient and tidy (e.g., purrr and/or list columns) approach?

I understand there are many ways of converting list column to data.frame/tibble; however, I'm wondering if it would be better to avoid data.frame/tibbles in general and keep in lists to speed up the calculation