I am using the "tesseract" library in R to convert "PDF files into text", like shown over here: Using the Tesseract OCR engine in R
library(pdftools)
library(tesseract)
pngfile <- pdftools::pdf_convert('myfile_1.pdf', dpi = 600)
text <- tesseract::ocr(pngfile)
cat(text)
The above code works perfectly. Now, I am trying to "mass upload" a large number of PDF files and convert them into text- currently, I figured out how to do this manually
#import and convert 1st file
pngfile_1 <- pdftools::pdf_convert('myfile_1.pdf', dpi = 600)
text_1 <- tesseract::ocr(pngfile_1)
#import and convert 2nd file (note: the files do not have the same naming convention)
pngfile_2 <- pdftools::pdf_convert('second_file.pdf', dpi = 600)
text_2 <- tesseract::ocr(pngfile_2)
etc
I copied/pasted the above code 50 times (while changing the "index", i.e. pngfule_i, text_i
) and was able to accomplish what I wanted to do. However, I am looking for a somewhat "automatic" to import and convert all the pdf files.
At the moment, all my pdf files are in the following folder:
"C:/Users/me/Documents/mypdfs"
I found the following code which can be used to "mass import" pdf files into R:
library(dplyr)
library(data.table)
tbl_fread <-
list.files(pattern = "*.pdf") %>%
map_df(~fread(.))
But I am not sure how to instruct this code to import all pdf's from the correct directory ( "C:/Users/me/Documents/mypdfs"
). I also don't know how to instruct R to "rename" each imported pdf as "pdf_1, pdf_2, etc."
If all the pdf files were correctly imported and created, I could then write a "loop" and execute the desired commands, e.g.
# "n" would be the total number of pdf files
for (i in 1:n)
{
pngfile_i <- pdftools::pdf_convert('myfile_i.pdf', dpi = 600)
text_i <- tesseract::ocr(pngfile_i)
}
Can someone please show me how to do this?
Thanks