I'm currently using R as part of a research project and need to do a "word count" on a list of words contained within PDF files.
As a new R user, I spent a week searching YouTube and Google tutorials to learn and try many different codes. I had thought I finally cracked it but it seems there remains some inconsistency issues, where the code has trouble picking up "2 words" (those with spaces between them) in some of the PDF files. I assume this has something to do with the cleaning part of the code.
Can anyone please help?! The code I’m currently using is as follows:
library(pdftools)
library(tm)
library(dplyr)
library(tidytext)
library(readr)
library(ggplot2)
library(quanteda)
files = list.files(pattern = "pdf$")
files
all=lapply(files, pdf_text)
lapply(all, length)
pdfdatabase=Corpus(URISource(files),readerControl = list(reader = readPDF))
pdfdatabase
pdfdatabase= tm_map(pdfdatabase, removePunctuation, ucp = TRUE)
all.tdm=TermDocumentMatrix(pdfdatabase,control = list(stopwords = TRUE,
tolower = TRUE,
stem = TRUE,
removeNumbers = TRUE,
bounds = list(global = c(1, Inf))))
inspect(all.tdm[c("encryption"),])
inspect(all.tdm[c("computer virus"),])
inspect(all.tdm[c("information security"),])
inspect(all.tdm[c("computer security"),])
inspect(all.tdm[c("hacking"),])
inspect(all.tdm[c("hacker"),])
inspect(all.tdm[c("denial of service"),])
inspect(all.tdm[c("cyber-attack"),])