Extract from pdf file

Muath · June 5, 2020, 3:39pm

Hi, I would have codes that assist me in having the frequency-specific words from pdf files.
For example, management with many words. Such as board risk Etc..

I have arrived at this, but it doesn't represent the specific word I need.

files <- list.files(pattern = "pdf$")
files
opinions <- lapply(files, pdf_text)
length(opinions)
lapply(opinions, length)
opinions
library(tm)
corp <- VCorpus(VectorSource(opinions))
opinions.tdm <- TermDocumentMatrix(corp,
control =
list(removePunctuation = TRUE,
stopwords = TRUE,
tolower = TRUE,
stemming = TRUE,
removeNumbers = TRUE,
bounds = list(global = c(10, Inf))))
inspect(opinions.tdm[1:50,])
corp <- tm_map(corp, removePunctuation, ucp = TRUE)
opinions.tdm <- TermDocumentMatrix(corp,
control =
list(stopwords = TRUE,
tolower = TRUE,
stemming = TRUE,
removeNumbers = TRUE,
bounds = list(global = c(10, Inf))))
inspect(opinions.tdm[1:10,])
findFreqTerms(opinions.tdm,
lowfreq = 100,
highfreq = Inf)

ft <- findFreqTerms(opinions.tdm,
lowfreq = 100,
highfreq = Inf)
as.matrix(opinions.tdm[ft,])

ft.tdm <- as.matrix(opinions.tdm[ft,])
sort(apply(ft.tdm, 1, sum), decreasing = TRUE)
ft <- findFreqTerms(opinions.tdm,
lowfreq = 100,
highfreq = Inf)
as.matrix(opinions.tdm[ft,])

system · June 26, 2020, 3:40pm

This topic was automatically closed 21 days after the last reply. New replies are no longer allowed.