objective evaluation for determining number of topics in topic modelling LDA?

Coding_corgi · January 6, 2021, 11:51pm

Hi everyone, happy new years!

I am currently in the midst of reading literature on determining the number of topics (k) for topic modelling using LDA. Currently the best article i found was this:

Zhao, W., Chen, J. J., Perkins, R., Liu, Z., Ge, W., Ding, Y., & Zou, W. (2015). A heuristic approach to determine an appropriate number of topics in topic modeling. BMC Bioinformatics, 16(13), S8. https://doi.org/10.1186/1471-2105-16-S13-S8

I wish to reference this for my thesis, but im not sure if R has a functionality to determine the rate of perplexity change (a heuristic approach to estimate the number of topics). Does anyone know how to implement this in R? This seems highly similar to using eigen values in determining the number of factors for exploratory factor analysis.

Any help appreciated.

EDIT: Sincere apologies all, the topicmodels package has this functionality, however the code takes a really long time to load. REFERENCE Code below.

If anyone else has any ideas to add to this topic (no pun intended!) please feel free to comment.

# Load up R packages including a few we only need later:
library(topicmodels)
library(doParallel)
library(ggplot2)
library(scales)
library(tidyverse)
library(RColorBrewer)
library(wordcloud)
data("AssociatedPress", package="topicmodels")
full_data <- AssociatedPress

system.time({
tunes <- FindTopicsNumber(
   full_data,
   topics = c(1:10 * 10, 120, 140, 160, 180, 0:3 * 50 + 200),
   metrics = c("Griffiths2004", "CaoJuan2009", "Arun2010"),
   method = "Gibbs",
   control = list(seed = 77),
   mc.cores = 4L,
   verbose = TRUE
)
})

FindTopicsNumber_plot(tunes)

technocrat · January 7, 2021, 1:35am

Here's a timed run for 13% of document term matrices in the AssociatedPresss, about 80 sec. , with 8 cores. (Which is only 10 sec. faster than 4 cores.) Either provision some big iron or find a way to prescreen the metrics, I'd guess.

library(doParallel)
#> Loading required package: foreach
#> Loading required package: iterators
#> Loading required package: parallel
library(ldatuning)
library(tictoc)
library(topicmodels)

data("AssociatedPress")

dtm <- AssociatedPress[1:10,]
topics = c(1:10 * 10, 120, 140, 160, 180, 0:3 * 50 + 200)
metrics = c("Griffiths2004", "CaoJuan2009", "Arun2010")
method = "Gibbs"

run_find <- function(w,x,y,z) {
  tic()
  tunes <- FindTopicsNumber(
    dtm <- w,
    topics = x,
    metrics = y,
    method = z,
    control = list(seed = 77),
    mc.cores = 8L,
    verbose = TRUE)
  toc()
}

run_find(dtm,topics,metrics,method)
#> fit models... done.
#> calculate metrics:
#>   Griffiths2004... done.
#>   CaoJuan2009... done.
#>   Arun2010... done.
#> 78.296 sec elapsed

^{Created on 2021-01-06 by the reprex package (v0.3.0.9001)}

system · January 14, 2021, 1:35am

This topic was automatically closed 7 days after the last reply. New replies are no longer allowed.

If you have a query related to it or one of the replies, start a new topic and refer back with a link.