Hello everyone! I’m new here, nice to meet you!
I’m working on a project where I use text mining to generate some sets of word from a data set of documents.
I would like to generate sentences (instead of words) based on the correlation between the key words generated before from text mining.
My actual code is the follow:
*#INPUT *
*data <- read.csv2("DATABASE.csv") *
*removewords <- read.csv("remove.csv") #CSV REMOVE CUSTOM STOP WORDS *
#PRE-PROCESSING
processed <- textProcessor(data$ColumnName, metadata = data, customstopwords = remove $ï.. ColumnNameRemoveCSV, verbose=TRUE)
*out <- prepDocuments(processed$documents, processed$vocab, processed$meta, lower.thresh=5, verbose=TRUE ) *
docs <- out$ ColumnName
vocab <- out$vocab
meta <-out$meta
z<- data[-out$docs.removed,]
# STM
poliblogPrevFit <- stm(documents=out$documents, vocab=out$vocab,K=26,max.em.its=75, data=out$meta, init.type="Spectral")
write.csv (poliblogPrevFit$theta,file='Matrix.csv')
STM returns a matrix with percentage correlation between topics and documents.
#Label topics by listing top words for all topics. Save as txt file.
labelTopicsAll <- labelTopics(poliblogPrevFit, c(1:26), n=10)
sink("Topic-keywords.txt", append=FALSE, split=TRUE)
I want to take keywords from every topic and build some sentences using a combination of them and considering the input text of the most correlated documents (correlation index >0.6).
I’m trying to use a code based on Markov Chain.
# read data
tlp <- read.delim("Text1.txt") # In the Text1.txt there are copy-pasted documents from the previous code.
-
*
head(tlp,10)
*tlp_clean <- tlp %>% *
-
- filter(!str_detect(text, "[/:]"), # remove lines with certain characters*
-
!str_detect(text, "http")) # remove lines with certain string*
head(tlp_clean)
*tlp_clean <- tlp_clean %>% *
- mutate(text = tolower(text) %>% # tolower sentences*
-
replace_contraction() %>% # expand contraction*
-
str_remove_all(pattern = "[0-9]") %>% # remove numbers*
-
str_remove_all(pattern = "[()]") %>% # remove specific punctuation*
-
str_remove_all(pattern = "--") %>%*
-
str_replace_all(pattern = " - ", replacement = "-") %>% # replace pattern*
-
str_replace_all(pattern = "'ve", replacement = "have") %>% *
-
str_remove(pattern = "[.]") %>% # remove first matched pattern*
-
str_remove(pattern = " "))*
# glimpse data; first 10 sentences
head(tlp_clean, 10)
# split words from sentences
*text_tlp <- tlp_clean %>% *
- pull(text) %>% *
- strsplit(" ") %>% *
- unlist() *
text_tlp %>% head(27)
fit_markov <- markovchainFit(text_tlp)
create_me <- function(num = 5, n = 3) {
-
-
- for (i in 1:num) {*
-
- set.seed(i+5)*
-
- markovchainSequence(n = n, # generate 3 additional random words*
-
markovchain = fit_markov$estimate,*
-
t0 = tolower(first_word), include.t0 = T) %>% *
-
# joint words*
-
paste(collapse = " ") %>% # join generated words with space*
-
print()*
-
-
- }*
-
}
The code doesn’t work. I have to generate sentences based on my output keywords from the first code I wrote.
Someone has some ideas?