Extracting Invoice Text From Image (PDF,JPEG,PNG)

Hello All,
I am trying to read an .PNG, JPEG image and need to extract the text from that image. The extracted information is Invoices image I want to write those information in Excel sheet. I re-used the code published in this provider:
ExtractTable - API to convert image to excel, extract tables from PDF , here is the complete code:

## Load required R packages (must be installed first)
install.packages(c("magrittr", "jsonlite", "httr"))
require(magrittr)
require(jsonlite)
require(httr)


# Main Functions

## Parse Server Response
parseResponse <- function(server_resp) {return(fromJSON(content(server_resp, "text", encoding="UTF-8")))}


## Function to Check credits usage
check_credits <- function(api_key) {
  validate_endpoint = 'https://validator.extracttable.com'
  return(content(GET(url = validate_endpoint, add_headers(`x-api-key` = api_key)), as = 'parsed', type = 'application/json'))
}

## Function to Retrieve the result by JobId
retrieve_result <- function(api_key, job_id) {
  retrieve_endpoint = "https://getresult.extracttable.com"
  return(
    GET(
      url = paste0(retrieve_endpoint, "/?JobId=", job_id),
      add_headers(`x-api-key` = api_key)
    )
  )
}


## Function to trigger a file for extraction
proces_file <- function(api_key, filepath) {
  trigger_endpoint = "https://trigger.extracttable.com"
  return (
    POST(
      url = trigger_endpoint,
      add_headers(`Content-Type`="multipart/form-data", `x-api-key` = api_key),
      body = list(input = upload_file(filepath))
    )
  )
}


## Function to extract all tables from the input file
ExtractTable <- function(filepath, api_key) {
  server_response <- proces_file(api_key, filepath)
  parsed_resp = parseResponse(server_response)
  
  
  # Wait for a maximum of 5 minutes to finish the trigger job
  # Retries every 20 seconds
  max_wait_time = 5*60
  retry_interval = 20
  while (parsed_resp$JobStatus == 'Processing' & max_wait_time >= 0) {
    max_wait_time = max_wait_time - retry_interval
    print(paste0("Job is still in progress. Let's wait for ", retry_interval, " seconds"))
    Sys.sleep(retry_interval)
    server_response <- retrieve_result(api_key, job_id=parsed_resp$JobId)
    parsed_resp = parseResponse(server_response)
  }
  
  ### Parse the response for tables
  et_tables <- content(server_response, as = 'parsed', type = 'application/json')
  
  all_tables <- list()
  
  if (tolower(parsed_resp$JobStatus) != "success") {
    print(paste0("The processing was NOT SUCCESSFUL Below is the complete response from the server"))
    print(parsed_resp)
    return(all_tables)
  }
  
  ### Convert the extracted tabular JSON data as a dataframe for future use
  ### Each data frame represents one table
  for (i in 1:length(et_tables$Table)) {
    all_tables[[i]] <- sapply(et_tables$Tables[[i]]$TableJson, unlist) %>% t() %>% as.data.frame()
  }
  
  return(all_tables)
  
} #end of function



# Usage

## Intialize valid API key received from https://extracttable.com
api_key = ""

# Validate or check credits of the API key
credits <- check_credits(api_key = api_key)$usage


input_location = "E:/OCR Test/Test Bill.jpeg"
Excel_location = "E:/OCR Test/"

# Trigger the job for processing and get results as an array of dataframes
# Each data frame represents one table
results <- ExtractTable(api_key = api_key, filepath = input_location)
Size<-length(results)
i=1
for(i in 1:Size) {
  # No<-as.character(i)
  write.xlsx2(results[[i]], paste(Excel_location, "data_all.xlsx"), row.names = FALSE, sheetName = paste("Sheet", as.character(i), sep=""), append = TRUE)  # Append other data frames
}

I have 3 Questions on this regard

Q.1 We have a webpage for uploading the Image file to be extracted , on this page
https://forms.pabbly.com/form/share/6BdC-483294

How can I embed shiny inside the page to get the uploaded Image and then put in the variable (Input_location) in R code? Also how to output the Excel file to be downloaded by the user of the page?

Q.2 Our customers need the output in excel with a specific template format , how can you help me to arrange the output on the same format like the one on the screen-shot below?

Q.3 For the Arabic language I have the following code:

install.packages("tesseract")
library(tidyverse)
library(tesseract)
tesseract_info()

knitr::include_graphics("E:/OCR Test/Invoice.PNG")

textt3 <- tesseract::ocr(image = "E:/OCR Test/Invoice2.jpeg",
                         engine = tesseract("ara"))
cat(textt3)

How can we use the code of extracting Arabic words in the original code posted earlier?

Q.4 Any suggestion if I don't want to use the service (API Token) from the provider ,
And do my own homework to get the same results without using API Token?

This topic was automatically closed 21 days after the last reply. New replies are no longer allowed.

If you have a query related to it or one of the replies, start a new topic and refer back with a link.