Hello there,
Memory issues here.
I am doing some topic modelling in R (https://en.wikipedia.org/wiki/Topic_model) using the text2vec package (http://text2vec.org/). If you haven't looked at topic modelling the only thing you need to know is that it's a family of algorithms to extract topics from document-feature matrices by assuming that every document can be about a number K of topics, where every topic in K is actually a probability distribution over the features. The point here is that K is a hyperparameter; you need to fix the number of available topics in advance, similarly to k-means clustering, where you fix the number of clusters in advance. I am then interested in doing hyperparameter tuning of the K parameter using cross validation, that is, doing 5-fold cross validation for each k in seq(30, 100, by = 10). The error measure computed in the cross-validation is the perplexity measure, but this is not important.
Now, I am noticing that during the computation I am sometimes hitting the ceiling of the avaliable RAM I have (64GB); although as soon as the free memory becomes dangerously low (and I start worrying looking at my watch -n 5 free -m on the terminal...) a large chunk of it seems to get freed, which I surmise is just the R lazy policy of freeing up memory only when needed.
Still, I was wondering if any of you had some general suggestions on how to keep memory under control or being generally efficient about memory in this context, since I am worried I might remain without free memory if i'm not cautious about it.
The first problem I have is that (I have been a bit sloppy I guess..) in my code I am reading a large number of documents, turning them into a document-feature matrix using the quanteda package (https://cran.r-project.org/web/packages/quanteda/index.html) and I repeatedly keep modifying the matrix to clean it of various terms, using code like
matrix <- do_something(matrix)
Now, I believe can free up some memory using rm() + gc() to get rid of large objects I don't need. Do you people use this combination? I looked on stack overflow and there seems to be different views on whether it makes sense to use gc() or not. Still, even if I manually remove these large objects, I don't think there's a way around the pattern
matrix <- do_something(matrix)
That I am using, as there's various sequential operations of text cleaning that I need to perform. I understand that when doing the above R is actually copying the matrix object. Would using gc() free up the old copies of the matrix object?
Finally, I wonder if the way I do 5-fold cross validation is memory smart. Here the code:
fit_lda_model <- function(numtopics, dataset) {
lda_model <- LDA$new(n_topics = numtopics,
doc_topic_prior = 0.1,
topic_word_prior = 0.01)
lda_model_fit <- lda_model$fit_transform(dataset,
n_iter = 2000,
convergence_tol = 0.001,
n_check_convergence = 25)
lda_model
}
validate_top <- function(numtopics,
in_data,
out_data){
fitted_lda <- fit_lda_model(numtopics = numtopics, dataset = in_data)
perpl <- perplexity(out_data,
topic_word_distribution = fitted_lda$topic_word_distribution,
doc_topic_distribution = fitted_lda$transform(out_data))
perpl
}
compute_models <- function(topics, numfolds = 5, trainingdata){
splitfolds <- sample(1:numfolds, ndoc(trainingdata), replace = TRUE)
perplexities <- matrix(nrow = numfolds, ncol = length(topics))
for (i in 1:numfolds) {
in_data <- trainingdata[splitfolds != i, ]
out_data <- trainingdata[splitfolds == i, ]
perplexities[i,] <- unlist(mclapply(topics,
validate_top,
in_data,
out_data,
mc.cores = detectCores()))
}
final_models <- mclapply(topics,
fit_lda_model,
dataset = trainingdata,
mc.cores = detectCores())
list(perplexities, final_models)
}
ntop <- seq(30, 120, by = 10)
fm_12grams_measures <- compute_models(ntop, numfolds = 5, trainingdata = fm_12grams_train)
The code is divided into three functions: fit_lda_model just fits the model given a number of topics k and a dataset, validate_top fits a model for a given number of topics k on in_data (using fit_lda_model) and calculates the error measure (perplexity) on the holdout (out) data, and compute_models does the job of splitting the training data into five and performing validation for each topic number parallely (mclapply) for each run of i in the loop, using validate_top; it returning a matrix (perplexities) where each column gives the error measure for each topic for each of the five runs. The last block of code in compute_models trains the model for each value of K on the entire training dataset and returns the final models; this is because the perplexity measure is not always useful, and I want to be able to inspect the models trained on the full training data for each value of K.
I wonder if this code is memory efficient. I am particularly worried about the for loop inside compute_models, since I am repeatedly creating new datasets with the
in_data <- trainingdata[splitfolds != i, ]
out_data <- trainingdata[splitfolds == i, ]
Bit. Should I include gc() at the end of the for loop? I will be grateful to anyone who can point out a better way of doing this/flaws in the code.
Riccardo.