Converting PDF's to Text

swaheera · July 31, 2021, 4:30am

I am using the "tesseract" library in R to convert "PDF files into text", like shown over here: Using the Tesseract OCR engine in R

library(pdftools)
library(tesseract)

pngfile <- pdftools::pdf_convert('myfile_1.pdf', dpi = 600)
text <- tesseract::ocr(pngfile)
cat(text)

The above code works perfectly. Now, I am trying to "mass upload" a large number of PDF files and convert them into text- currently, I figured out how to do this manually

#import and convert 1st file
   pngfile_1 <- pdftools::pdf_convert('myfile_1.pdf', dpi = 600)
    text_1 <- tesseract::ocr(pngfile_1)

#import and convert 2nd file (note: the files do not have the same naming convention)
   pngfile_2 <- pdftools::pdf_convert('second_file.pdf', dpi = 600)
    text_2 <- tesseract::ocr(pngfile_2)

etc

I copied/pasted the above code 50 times (while changing the "index", i.e. pngfule_i, text_i ) and was able to accomplish what I wanted to do. However, I am looking for a somewhat "automatic" to import and convert all the pdf files.

At the moment, all my pdf files are in the following folder:

"C:/Users/me/Documents/mypdfs"

I found the following code which can be used to "mass import" pdf files into R:

library(dplyr)
library(data.table)


tbl_fread <- 
    list.files(pattern = "*.pdf") %>% 
    map_df(~fread(.))

But I am not sure how to instruct this code to import all pdf's from the correct directory ( "C:/Users/me/Documents/mypdfs" ). I also don't know how to instruct R to "rename" each imported pdf as "pdf_1, pdf_2, etc."

If all the pdf files were correctly imported and created, I could then write a "loop" and execute the desired commands, e.g.

# "n" would be the total number of pdf files 

for (i in 1:n)
{
pngfile_i <- pdftools::pdf_convert('myfile_i.pdf', dpi = 600)
text_i <- tesseract::ocr(pngfile_i)
}

Can someone please show me how to do this?

Thanks

andresrcs · August 1, 2021, 4:35pm

This would be the general code pattern

library(tesseract)
library(pdftools)
library(tidyverse)

list.files(path = "path/to/your/files",
           pattern = "\\.pdf",
           full.names = TRUE) %>% 
    set_names() %>% 
    map_dfr(.f = ~ {
        pdf_convert(.x, dpi = 600) %>% 
            ocr() %>% 
            as_tibble()
    }, .id = "file_name")

#> Converting page 1 to pdf_1_1.png... done!
#> Converting page 1 to pdf_2_1.png... done!
#> # A tibble: 2 x 2
#>   file_name   value                    
#>   <chr>       <chr>                    
#> 1 ./pdf_1.pdf "This is a pdf\n"        
#> 2 ./pdf_2.pdf "This is a another pdf\n"

swaheera · August 2, 2021, 4:27am

Thank you for your answer! I figured out an alternate way to solve this problem - would you like to see it?

Thanks again!

system · August 23, 2021, 4:27am

This topic was automatically closed 21 days after the last reply. New replies are no longer allowed.

If you have a query related to it or one of the replies, start a new topic and refer back with a link.