Load language in tesseract library in shinyapps.io

I want to load a language for tesseract package in shinyapps.io

It works perfectly on my computer, but when I load the application to shinyapp.io I get the error:

Please make sure the TESSDATA_PREFIX environment variable is set to the parent directory of your "tessdata" directory.
Failed loading language 'ron'

Is there any way to do that from R code without using the cmd line? I do not have access to cmd line.

My code is this:

if (str_sub(packageVersion("tesseract"),start = 1, end = 1) != "4") {update.packages("tesseract")}
library(tesseract) # this makes sure that it loads the last version of tesseract

if (!"ron" %in% tesseract_info()[[2]]) {
tesseract_download(lang = "ron") } # this checks for "ron" language and downloads it if it does not find it

define_tesseract_engine <- tesseract(language = "ron")

Have you tried calling Sys.setenv(TESSDATA_PREFIX = "/tmp") or perhaps Sys.setenv(TESSDATA_PREFIX = ".")?

Currently only English training data is present when using tesseract on shinyapps.io. If the above does not work to allow downloading additional languages, then the install script could be altered to swap tesseract-ocr-eng for teseract-ocr-all.

Thank you, Josh for your quick reply.

Unfortunately the first 2 solutions do not work. I am still struggling with your third suggestion, that is altering the install script.

I have the same issue. With the first 2 solutions it does not work neither for me. @marius37 did you figurate out a solution?

On shinyapps.io there is a mechanism for installing additional system packages that is not available to rstudio.cloud.

Hello Enric
I have not figured out a solution, because I have discovered that tesseract version 4.0 is not really controlable, even though it has hundreds of control parameters you can define.
I needed Romanian language, which has a latin alphabet, so after multiple tries, results produced by tesseract 4.0 will be the same (at least for my needs) no matter the language I choose.

What bothers me the most is that tesseract 4.0 is unstable,meaning I get slightly different outputs for the same image that is processed multiple times. Also another anoying thing is that R version of tesseract 4.0 does not produce bbox dimensions for characters (not for words). I need it because word dimensions produced (with HOCR) are not accurate and this could definitely be improved with symbol dimensions (especially for trying to read standardised documents).

This topic was automatically closed 54 days after the last reply. New replies are no longer allowed.

If you have a query related to it or one of the replies, start a new topic and refer back with a link.