I have a tibble that is two columns: an id string and a list of numeric vectors.
Naively, I can create a dendrogram by row binding the list in a do.call then running the distance.
# data (read in via MongoDB, i.e. mongolite); use runif for random samples t <- tibble(id = c("123", "234", "345", "456"), l = list(runif(3), runif(3), runif(3), runif(3))) # combine list into data.frame z <- do.call(rbind.data.frame, t$l) # run dendrogram dend <- as.dendrogram(hclust(dist(z)))
However, this approach scales terribly. My original dataset is 380k rows with each record matched to a vector of 512 elements (numeric).
What is an efficient and tidy (e.g., purrr and/or list columns) approach?
I understand there are many ways of converting list column to data.frame/tibble; however, I'm wondering if it would be better to avoid data.frame/tibbles in general and keep in lists to speed up the calculation