I would like to analyse pdf files in RStudio with the Quanteda package.
I tried several options to convert pdf files in .rda extension but the procedures I followed did not work and seem to be quite intricate. Thus I kindly wanted to ask you if you know how such a conversion can be performed.
For pdf with text as text (i.e. not scans) I have had success with {pdftools}. For scans & OCR I was told that {tesseract} is a good choice but I have not tried it personally.
This is a sample of my workflow when using the package:
library(pdftools)
library(stringr)
asdf <- pdf_text("path-to-yer-document.pdf") # read the file in / as a list of pages
res <- "" # global init
for (i in seq_along(asdf)) {
res <- paste0(res, asdf[i]) # paste individual pages together
}
res <- str_replace_all(res, "\n", " ") # replace newlines with spaces
res <- str_replace_all(res, "\\s+", " ") # replace multiple spaces with a single one
print(res) # look what the cat has brought in!