Text data information Extraction - OCR in R

sraperia · February 4, 2020, 5:26pm

Hi All,

I need to read 100 pdf documents, where I need to extract the text information from the pdf and export the excel. In the pdf there are various text from which I need to create the data table. I am giving a part of the pdf from which I need to extract the information.

I am doing my job in the company(Employee Id : 12345678)
Name : XXXXX YYYYY
** Date of Birth : 12/12/2001**
** Place : AAAAAAAA**
** Address: 111, BLOCK 1,**
** XYZ LOCALITY**
** BANGKOK **
** Email id: xyz@yahoo.in**

I have to create the columns and extract all the information along with it from all the pdfs in Excel.
I am trying to use tesseract and pdf_convert.

yan_lyesin · February 4, 2020, 6:49pm

Hi sraperia,
Some PDFs might already contain textual information presented in your document and you can read and parse PDF using excellent pdftools package available on CRAN. If your PDFs were scanned, and OCR was not performed - OCR will be required.
The best way to determine what type of PDF document you have is to try to select text using text selection tool in Acrobat. If you can select exact text that you see on screen - you can use pdftools.

system · February 25, 2020, 6:49pm

This topic was automatically closed 21 days after the last reply. New replies are no longer allowed.