Fitting Markov-Chain and text mining

Hello everyone! I’m new here, nice to meet you!
I’m working on a project where I use text mining to generate some sets of word from a data set of documents.
I would like to generate sentences (instead of words) based on the correlation between the key words generated before from text mining.
My actual code is the follow:

*#INPUT *
*data <- read.csv2("DATABASE.csv") *
*removewords <- read.csv("remove.csv") #CSV REMOVE CUSTOM STOP WORDS *

#PRE-PROCESSING
processed <- textProcessor(data$ColumnName, metadata = data, customstopwords = remove $ï.. ColumnNameRemoveCSV, verbose=TRUE)
*out <- prepDocuments(processed$documents, processed$vocab, processed$meta, lower.thresh=5, verbose=TRUE ) *
docs <- out$ ColumnName
vocab <- out$vocab
meta <-out$meta

z<- data[-out$docs.removed,]
# STM
poliblogPrevFit <- stm(documents=out$documents, vocab=out$vocab,K=26,max.em.its=75, data=out$meta, init.type="Spectral")
write.csv (poliblogPrevFit$theta,file='Matrix.csv')

STM returns a matrix with percentage correlation between topics and documents.

#Label topics by listing top words for all topics. Save as txt file.
labelTopicsAll <- labelTopics(poliblogPrevFit, c(1:26), n=10)
sink("Topic-keywords.txt", append=FALSE, split=TRUE)

I want to take keywords from every topic and build some sentences using a combination of them and considering the input text of the most correlated documents (correlation index >0.6).

I’m trying to use a code based on Markov Chain.

# read data
tlp <- read.delim("Text1.txt") # In the Text1.txt there are copy-pasted documents from the previous code.

  •              *
    

head(tlp,10)

*tlp_clean <- tlp %>% *

  • filter(!str_detect(text, "[/:]"), # remove lines with certain characters*
  •     !str_detect(text, "http")) # remove lines with certain string*
    

head(tlp_clean)

*tlp_clean <- tlp_clean %>% *

  • mutate(text = tolower(text) %>% # tolower sentences*
  •       replace_contraction() %>%  # expand contraction*
    
  •       str_remove_all(pattern = "[0-9]") %>% # remove numbers*
    
  •       str_remove_all(pattern = "[()]") %>% # remove specific punctuation*
    
  •       str_remove_all(pattern = "--") %>%*
    
  •       str_replace_all(pattern = " - ", replacement = "-") %>%  # replace pattern*
    
  •       str_replace_all(pattern = "'ve", replacement = "have") %>% *
    
  •       str_remove(pattern = "[.]") %>% # remove first matched pattern*
    
  •       str_remove(pattern = " "))*
    

# glimpse data; first 10 sentences
head(tlp_clean, 10)

# split words from sentences
*text_tlp <- tlp_clean %>% *

  • pull(text) %>% *
  • strsplit(" ") %>% *
  • unlist() *

text_tlp %>% head(27)

fit_markov <- markovchainFit(text_tlp)

create_me <- function(num = 5, n = 3) {

  • for (i in 1:num) {*
  • set.seed(i+5)*
  • markovchainSequence(n = n, # generate 3 additional random words*
  •                    markovchain = fit_markov$estimate,*
    
  •                    t0 = tolower(first_word), include.t0 = T) %>% *
    
  •  # joint words*
    
  •  paste(collapse = " ") %>% # join generated words with space*
    
  •  print()*
    
  • }*

}

The code doesn’t work. I have to generate sentences based on my output keywords from the first code I wrote.

Someone has some ideas?

This topic was automatically closed 21 days after the last reply. New replies are no longer allowed.

If you have a query related to it or one of the replies, start a new topic and refer back with a link.