Extracting data from PDF files

Piranha · November 7, 2018, 8:16pm

Hello,

I am trying to efficiently export data that is locked inside PDF files. I know that there are a few ways of doing this. I am starting off by learning how to use the Tabulizer package.

As an example, here is the PDF that I am using:

Skechers Annual Report

During my initial exploration, I tried exporting the table in page 16:

url <- "https://d1io3yog0oux5.cloudfront.net/_2747cc8535dcb78c0cc79115c3968b21/skx/db/437/3006/annual_report/2017_ANNUAL_REPORT.pdf"

test_pg_16 <- extract_tables(url, pages = 16, guess = TRUE, output = "data.frame")

> test_pg_16
[[1]]
                                                                                         X X2016 X.1 X2017 X.2 X2017.1  X.3
1                                                                          Domestic stores    NA  NA    NA  NA      NA     
2  Concept ...............................................................................    NA 117    NA   5      NA  (5)
3     Factory Outlet .....................................................................    NA 163    NA   7      NA  â€”
4         Warehouse Outlet ...............................................................    NA 133    NA  29      NA  â€”
5         Domestic stores total ..........................................................    NA 413    NA  41      NA  (5)
6                                                                     International stores    NA  NA    NA  NA      NA     
7  Concept ...............................................................................    NA 101    NA  19      NA  â€”
8     Factory Outlet .....................................................................    NA  51    NA  16      NA  â€”
9         Warehouse Outlet ...............................................................    NA   5    NA   4      NA  â€”
10        International stores total .....................................................    NA 157    NA  39      NA  â€”

The output is obviously quite messy, but I think I will be able to find some way to clean up most of it via some data wrangling in dplyr.

My key question: what are my options if the data is embedded inside the PDF as an image?

For example, page 89 of this annual report looks like this:

test_pg_89 <- extract_tables(url, pages = 89, guess = TRUE, output = "data.frame")
# this comes up empty

pg_89_text_only <- extract_text(url, page = 89)
# this also comes up empty

I realize that this may be a fairly complex problem. I am guessing one would have to extract the images first, and then find a way to read the text on those images via some sort of character recognition process. But I am working on a project that involves a lot of PDF files and a large number of them have stored the data as images embedded inside the PDF.

Any thoughts or pointers to resources would be very helpful. Thanks in advance.

pete · November 7, 2018, 9:13pm

I know this only helps with a small part of your task, but Tesseract works well or OCR. More info here with an example pulling text from an image: https://ropensci.org/technotes/2018/11/06/tesseract-40/

Piranha · November 7, 2018, 10:22pm

Hi @pete

Thank you for this suggestion! I did not know there was a tesseract R package. I will look into this now.