Import/Convert pdf files in Quanteda

Dear All,

I would like to analyse pdf files in RStudio with the Quanteda package.

I tried several options to convert pdf files in .rda extension but the procedures I followed did not work and seem to be quite intricate. Thus I kindly wanted to ask you if you know how such a conversion can be performed.

Thank you for your availability.

Best regards,

For pdf with text as text (i.e. not scans) I have had success with {pdftools}. For scans & OCR I was told that {tesseract} is a good choice but I have not tried it personally.

This is a sample of my workflow when using the package:

library(pdftools)
library(stringr)

asdf <- pdf_text("path-to-yer-document.pdf") # read the file in / as a list of pages
res <- "" # global init

for (i in seq_along(asdf)) { 
  res <- paste0(res, asdf[i]) # paste individual pages together
}

res <- str_replace_all(res, "\n", " ") # replace newlines with spaces
res <- str_replace_all(res, "\\s+", " ") # replace multiple spaces with a single one

print(res) # look what the cat has brought in!

It works! Thank you.

Glad to be of service! Text mining is exciting stuff...

And if my answer solved your issue, would you mind to mark it as solved? I would get brownie points in the membership league :slight_smile:

This topic was automatically closed 7 days after the last reply. New replies are no longer allowed.