I want to load a language for tesseract package in shinyapps.io
It works perfectly on my computer, but when I load the application to shinyapp.io I get the error:
Please make sure the TESSDATA_PREFIX environment variable is set to the parent directory of your "tessdata" directory. Failed loading language 'ron'
Is there any way to do that from R code without using the cmd line? I do not have access to cmd line.
My code is this:
if (str_sub(packageVersion("tesseract"),start = 1, end = 1) != "4") {update.packages("tesseract")}
library(tesseract) # this makes sure that it loads the last version of tesseract
if (!"ron" %in% tesseract_info()[[2]]) {
tesseract_download(lang = "ron") } # this checks for "ron" language and downloads it if it does not find it
Have you tried calling Sys.setenv(TESSDATA_PREFIX = "/tmp") or perhaps Sys.setenv(TESSDATA_PREFIX = ".")?
Currently only English training data is present when using tesseract on shinyapps.io. If the above does not work to allow downloading additional languages, then the install script could be altered to swap tesseract-ocr-eng for teseract-ocr-all.
Hello Enric
I have not figured out a solution, because I have discovered that tesseract version 4.0 is not really controlable, even though it has hundreds of control parameters you can define.
I needed Romanian language, which has a latin alphabet, so after multiple tries, results produced by tesseract 4.0 will be the same (at least for my needs) no matter the language I choose.
What bothers me the most is that tesseract 4.0 is unstable,meaning I get slightly different outputs for the same image that is processed multiple times. Also another anoying thing is that R version of tesseract 4.0 does not produce bbox dimensions for characters (not for words). I need it because word dimensions produced (with HOCR) are not accurate and this could definitely be improved with symbol dimensions (especially for trying to read standardised documents).