Help for parallelizing a BIO3D script

Hi everyone,

I am quite new to R programming and have been using it for the package Bio3d. I have a very large file which I have not been able to analyse on my own laptop, but I have access to a HPC with lots of computational power. I am trying to run a script which would let me make use of large amounts of RAM over several cores.

My script currently reads as:

library(ggplot2)
library(grid)
library(plyr)
library(dplyr)
library(gridExtra)
library(extrafont)
library(bio3d)
setwd("/home/ucbecla/Scratch")

#get trajectory and pdb
trj <- read.dcd("E_Test.dcd")
pdb <- read.pdb("E_Backbone.pdb")

#co ordinates
ca.inds <- atom.select(pdb, elety = "CA")
xyz <- fit.xyz(fixed = pdb$xyz, mobile = trj, fixed.inds = ca.inds$xyz, mobile.inds = ca.inds$xyz)
rm(trj)

#pca_1
pc <- pca.xyz(xyz[, ca.inds$xyz], mass = pdb)

#pca cluster by groups
hc <- hclust(dist(pc$z[, 1:2]))
grps <- cutree(hc, k = 6)
dend = as.dendrogram(hc)
rm(hc)

write.table(grps, "6_clusters.txt", sep="\t")

#Get frame number closest to centre of clusters
get_mid <- function(z, clust){
mid_clust <- colMeans(z[grps == clust,1:2])
rel <- z[grps == clust,1:2] - mid_clust
frame <- which(sqrt(rel[,1]**2+rel[,1]**2) == min(sqrt(rel[,1]**2+rel[,1]**2)))
frame <- which(sqrt(rel[,1]**2+rel[,1]**2) %in% min(sqrt(rel[,1]**2+rel[,1]**2)))[1]
mid_rep <- z[grps == clust,1:2][frame,]
rep_frame <- which(z[,1:2] == mid_rep)[1]
rep_frame <- which(z[,1:2] %in% mid_rep)[1]
return(rep_frame)
}

mid_c1 <- get_mid(pc$z,1)
print(mid_c1)
mid_c2 <- get_mid(pc$z,2)
print(mid_c2)
mid_c3 <- get_mid(pc$z,3)
print(mid_c3)
mid_c4 <- get_mid(pc$z,4)
print(mid_c4)
mid_c5 <- get_mid(pc$z,5)
print(mid_c5)
mid_c6 <- get_mid(pc$z,6)
print(mid_c6)

rm(grps)

png("Dendrogram.png", width = 567, height = 473, res = 600)
fviz_dend(cut(dend, h = 250)$upper, k = 6, k_colors = c("green", "blue", "magenta", "red", "black", "purple"), type = "rectangle", ylab = "", show_labels = FALSE)

dev.off()

However I think it is currently using only a single core of the 36 available; and since I have 720,018 frames that takes up the entire RAM memory of the core (roughly 41.5 Gb RAM). It usually reads up to the fit.xyz command and I get the error "cannot allocate memory".

Would there be a way to make it so that it runs over several cores, and therefore does not run out of memory ?

Many thanks,

Christophe

  1. Try adjusting ulimit (see man ulimit in the terminal). It may prevent the program from using potentially available RAM.

  2. The BIO3D package does not, from a cursory review, provide any functions to directly address the issue; it does, however, attach parallels in namespace.

  3. The CRAN Task View: High-Performance and Parallel Computing with R identifies many tools that may bear on the problem.

  4. It's safe to call this an advanced topic and one that would benefit from a broader search space. Possible sources include publications of Dirk Eddelbuettel (search his name plus HPC on rseek), the BIO3D maintainers, your HPC system administrator, the BIO3D issues where you might ask for an update. Strangely, S/O has very little at all on BIO3D.

Hi,

Thanks for the reply. I'll have a look at all of this and hope I can find something.
In a way it's reassuring to know that there doesn't seem to be a very trivial solution that I missed.

Cheers,

Christophe

1 Like

This topic was automatically closed 21 days after the last reply. New replies are no longer allowed.

If you have a query related to it or one of the replies, start a new topic and refer back with a link.