Hello i am very new to R, i am looking at how to convert a PDF to Excel. is there any example code and i can try?
This is a tougher task than it might seem, since PDF encoding is very complicated and can't always be extracted with the same spatial relations we perceive. For instance, copy-pasting from a PDF table often yields garbage.
Here's a blog post walking through one way:
If you've converted the data to image (eg using
imagemagick) you could then perform OCR with Tesseract:
thank you, i have managed to convert the PDF into images and output a CSV, how would i go about formating this CSV. e.g. separate the spaces into cells
here my code so far
library(tesseract) library(pdftools) # Render pdf to png image img_file <- pdftools::pdf_convert("filepath/test.pdf", format = 'tiff', dpi = 400) # Extract text from png image text <- ocr(img_file) writeLines(text, "filepath/mydata.csv")
This topic was automatically closed 21 days after the last reply. New replies are no longer allowed.
If you have a query related to it or one of the replies, start a new topic and refer back with a link.