Cross validation and memory

memory
parallel

#1

Hello there,

Memory issues here.

I am doing some topic modelling in R (https://en.wikipedia.org/wiki/Topic_model) using the text2vec package (http://text2vec.org/). If you haven’t looked at topic modelling the only thing you need to know is that it’s a family of algorithms to extract topics from document-feature matrices by assuming that every document can be about a number K of topics, where every topic in K is actually a probability distribution over the features. The point here is that K is a hyperparameter; you need to fix the number of available topics in advance, similarly to k-means clustering, where you fix the number of clusters in advance. I am then interested in doing hyperparameter tuning of the K parameter using cross validation, that is, doing 5-fold cross validation for each k in seq(30, 100, by = 10). The error measure computed in the cross-validation is the perplexity measure, but this is not important.

Now, I am noticing that during the computation I am sometimes hitting the ceiling of the avaliable RAM I have (64GB); although as soon as the free memory becomes dangerously low (and I start worrying looking at my watch -n 5 free -m on the terminal…) a large chunk of it seems to get freed, which I surmise is just the R lazy policy of freeing up memory only when needed.

Still, I was wondering if any of you had some general suggestions on how to keep memory under control or being generally efficient about memory in this context, since I am worried I might remain without free memory if i’m not cautious about it.

The first problem I have is that (I have been a bit sloppy I guess…) in my code I am reading a large number of documents, turning them into a document-feature matrix using the quanteda package (https://cran.r-project.org/web/packages/quanteda/index.html) and I repeatedly keep modifying the matrix to clean it of various terms, using code like

matrix <- do_something(matrix)

Now, I believe can free up some memory using rm() + gc() to get rid of large objects I don’t need. Do you people use this combination? I looked on stack overflow and there seems to be different views on whether it makes sense to use gc() or not. Still, even if I manually remove these large objects, I don’t think there’s a way around the pattern

matrix <- do_something(matrix)

That I am using, as there’s various sequential operations of text cleaning that I need to perform. I understand that when doing the above R is actually copying the matrix object. Would using gc() free up the old copies of the matrix object?

Finally, I wonder if the way I do 5-fold cross validation is memory smart. Here the code:

fit_lda_model <- function(numtopics, dataset) {
  lda_model <- LDA$new(n_topics = numtopics,
                       doc_topic_prior = 0.1,
                       topic_word_prior = 0.01)
  lda_model_fit <- lda_model$fit_transform(dataset,
                                           n_iter = 2000,
                                           convergence_tol = 0.001,
                                           n_check_convergence = 25)
  lda_model
}

validate_top <- function(numtopics,
                         in_data,
                         out_data){
  
  fitted_lda <- fit_lda_model(numtopics = numtopics, dataset = in_data)
  
  perpl <- perplexity(out_data, 
                      topic_word_distribution = fitted_lda$topic_word_distribution, 
                      doc_topic_distribution = fitted_lda$transform(out_data))
  perpl
}


compute_models <- function(topics, numfolds = 5, trainingdata){
  
  splitfolds <- sample(1:numfolds, ndoc(trainingdata), replace = TRUE)
  
  perplexities <- matrix(nrow = numfolds, ncol = length(topics))
  
  for (i in 1:numfolds) {
    in_data <- trainingdata[splitfolds != i, ]
    out_data <- trainingdata[splitfolds == i, ]
    
    perplexities[i,] <- unlist(mclapply(topics, 
                                        validate_top, 
                                        in_data, 
                                        out_data,
                                        mc.cores = detectCores()))
  }
  
  final_models <- mclapply(topics, 
                           fit_lda_model, 
                           dataset = trainingdata, 
                           mc.cores = detectCores())
  
  list(perplexities, final_models)
}

ntop <- seq(30, 120, by = 10)

fm_12grams_measures <- compute_models(ntop, numfolds = 5, trainingdata =  fm_12grams_train)

The code is divided into three functions: fit_lda_model just fits the model given a number of topics k and a dataset, validate_top fits a model for a given number of topics k on in_data (using fit_lda_model) and calculates the error measure (perplexity) on the holdout (out) data, and compute_models does the job of splitting the training data into five and performing validation for each topic number parallely (mclapply) for each run of i in the loop, using validate_top; it returning a matrix (perplexities) where each column gives the error measure for each topic for each of the five runs. The last block of code in compute_models trains the model for each value of K on the entire training dataset and returns the final models; this is because the perplexity measure is not always useful, and I want to be able to inspect the models trained on the full training data for each value of K.

I wonder if this code is memory efficient. I am particularly worried about the for loop inside compute_models, since I am repeatedly creating new datasets with the

    in_data <- trainingdata[splitfolds != i, ]
    out_data <- trainingdata[splitfolds == i, ]

Bit. Should I include gc() at the end of the for loop? I will be grateful to anyone who can point out a better way of doing this/flaws in the code.

Riccardo.