Get bounding boxes for symbols (not words) in tesseract R?

marius37 · February 17, 2019, 10:49am

I am using tesseract engine for detecting text in images (scanned PDFs). One important thing is to detect the position of a word inside a document and function ocr_data (from tesseract) does just that, it outputs words it finds and their coordinates.
Is there a way to produce the same output but for symbols, like every letter it has detected?
For example, the ocr_data produces for the word hello the following output: hello, 0.98, 10,20,60,30. I would like to produce the following output: h,0.98,5,20,15,30; e,0.98,6,20,16,30 etc. In Python, tesseract engine has a method called GetUTF8Text that outputs what just that.
Thank you.

maelle · February 26, 2019, 9:04am

@marius37! Your question sounds like a feature request for tesseract, you might have more luck asking about this in tesseract issue tracker.

Besides, tesseract is an rOpenSci package, so when you have questions about its use, you should ask them in rOpenSci forum rather than here.