PDF to csv conversation

Currently I am having 18pdf (each PDF with 5to 6pages)
PDF : it's a blood report

Task : I need to read all the PDF & need to read all the required data eg: patient name, test name, test reference range

I need to get all the datas from all pages of each and every PDF and need to convert it to a csv

I am not supposed to use extract areas. Since it's not a good way

1 Like

Hopefully your PDFs were not generated as images. If so, screen scraping / OCR is really the only way. Otherwise, the text data is encoded into the PDF - it is just not delimited and can be a pain to parse.

However, R has very good text parsers! This article explains the overview. I have mostly used pdftools with readr, but the tm package looks promising too. Hopefully it helps!

https://medium.com/@CharlesBordet/how-to-extract-and-clean-data-from-pdf-files-in-r-da11964e252e

3 Likes