findFreqTerms return top n terms per cluster (working with DTM / Corpus)

Hello,

I'm working with a DTM and kmeans clustering for text clustering / topic discovery. Currently, the output I have is alright, but I'd like to specify the number of words that show per cluster.

This is my current code:

#Reading data into DTM
dtm <- DocumentTermMatrix(test)
dtm <- removeSparseTerms(dtm,.999)
dtm_weighted <- weightTfIdf(dtm)
matrix <- as.matrix(dtm_weighted)
rownames(matrix) <- 1:nrow(matrix)
#Normalizing euclidian distance between words
norm_eucl <- function(matrix)
  matrix/apply(matrix, 1, function(x) sum(x^2)^.5)
matrix_norm <- norm_eucl(matrix)
#Clustering
results <- kmeans(na.omit(matrix_norm), 4, 40)
clusters <- 1:4
for (i in clusters) {
  cat("Cluster ", i, ":", findFreqTerms(dtm_weighted[results[["cluster"]]==i,],lowfreq = 1),"\n")
}

The output currently is akin to:

Cluster 1: w1 w2
Cluster 2: x1
Cluster 3: y1 y2 y3 y4
Cluster 4: z1 z2 z3 z4 z5 z6 z7

I'd like to have the output so it's always set to maybe 5 words per cluster, so it'd be like:

Cluster 1: w1 w2 w3 w4 w5
Cluster 2: x1 x2 x3 x4 x5
Cluster 3: y1 y2 y3 y4 y5
Cluster 4: z1 z2 z3 z4 z5

Solved myself....

n <- 5
for (i in clusters) {
  cat("Cluster ", i, ":", names(sort(results$centers[i,], decreasing=TRUE))[1:n],"\n")
}

This topic was automatically closed 7 days after the last reply. New replies are no longer allowed.