Hi everyone,
I am quite new to R programming and have been using it for the package Bio3d. I have a very large file which I have not been able to analyse on my own laptop, but I have access to a HPC with lots of computational power. I am trying to run a script which would let me make use of large amounts of RAM over several cores.
My script currently reads as:
library(ggplot2)
library(grid)
library(plyr)
library(dplyr)
library(gridExtra)
library(extrafont)
library(bio3d)
setwd("/home/ucbecla/Scratch")
#get trajectory and pdb
trj <- read.dcd("E_Test.dcd")
pdb <- read.pdb("E_Backbone.pdb")
#co ordinates
ca.inds <- atom.select(pdb, elety = "CA")
xyz <- fit.xyz(fixed = pdb$xyz, mobile = trj, fixed.inds = ca.inds$xyz, mobile.inds = ca.inds$xyz)
rm(trj)
#pca_1
pc <- pca.xyz(xyz[, ca.inds$xyz], mass = pdb)
#pca cluster by groups
hc <- hclust(dist(pc$z[, 1:2]))
grps <- cutree(hc, k = 6)
dend = as.dendrogram(hc)
rm(hc)
write.table(grps, "6_clusters.txt", sep="\t")
#Get frame number closest to centre of clusters
get_mid <- function(z, clust){
mid_clust <- colMeans(z[grps == clust,1:2])
rel <- z[grps == clust,1:2] - mid_clust
frame <- which(sqrt(rel[,1]**2+rel[,1]**2) == min(sqrt(rel[,1]**2+rel[,1]**2)))
frame <- which(sqrt(rel[,1]**2+rel[,1]**2) %in% min(sqrt(rel[,1]**2+rel[,1]**2)))[1]
mid_rep <- z[grps == clust,1:2][frame,]
rep_frame <- which(z[,1:2] == mid_rep)[1]
rep_frame <- which(z[,1:2] %in% mid_rep)[1]
return(rep_frame)
}
mid_c1 <- get_mid(pc$z,1)
print(mid_c1)
mid_c2 <- get_mid(pc$z,2)
print(mid_c2)
mid_c3 <- get_mid(pc$z,3)
print(mid_c3)
mid_c4 <- get_mid(pc$z,4)
print(mid_c4)
mid_c5 <- get_mid(pc$z,5)
print(mid_c5)
mid_c6 <- get_mid(pc$z,6)
print(mid_c6)
rm(grps)
png("Dendrogram.png", width = 567, height = 473, res = 600)
fviz_dend(cut(dend, h = 250)$upper, k = 6, k_colors = c("green", "blue", "magenta", "red", "black", "purple"), type = "rectangle", ylab = "", show_labels = FALSE)
dev.off()
However I think it is currently using only a single core of the 36 available; and since I have 720,018 frames that takes up the entire RAM memory of the core (roughly 41.5 Gb RAM). It usually reads up to the fit.xyz command and I get the error "cannot allocate memory".
Would there be a way to make it so that it runs over several cores, and therefore does not run out of memory ?
Many thanks,
Christophe