Need to extract tables from a pdf using R

Hi,
I need to extract a pdf table (side by side) as shown in the image and save it as a csv. Anybody who could help me here as to how I could do it ? I am using the following code but am unable to do so.

library(tabulizer)
report <- extract_tables("http://visionofhumanity.org/app/uploads/2017/11/Global-Terrorism-Index-2017.pdf", encoding = "UTF-8")



call_data_1 <- report[[25]]
call_data_2 <- report[[26]]

call_data <- rbind(call_data_1)
colnames(call_data) <- as.character(unlist(call_data[1,]))
call_data = call_data[-1, ]

write.csv(call_data, "C:\\my_project\\output1.csv", row.names = FALSE)

call_data2 <- rbind(call_data_2)
colnames(call_data2) <- as.character(unlist(call_data2[1,]))
call_data2 = call_data2[-1, ]

write.csv(call_data2, "C:\\my_project\\output2.csv", row.names = FALSE)
1 Like

I constantly do table extraction from PDF files. I do it using the plain data.table package and have successfully extracted many complicated tables :slight_smile: ... ofcourse the code would be a little longer but customisation is key for me.

Hi @Sanjmeh ,
Would you be willing to help me here ? Need to extract tables and push into csv from the following pdf file http://visionofhumanity.org/app/uploads/2017/11/Global-Terrorism-Index-2017.pdf
(page number 106 and 107)

Any help would be appreciated. Thanks

1 Like

Here's some code that I believe does what you're looking for. Often when using tabulizer I find I have to manually define the areas of the tables I want to extract. A great way to do this is with the locate_areas() function that will let you interactively drag a box around a table and give you the coordinates to plug into the area argument of extract_tables.

library(tidyverse)
library(tabulizer)

gti_table <- extract_tables(
  "http://visionofhumanity.org/app/uploads/2017/11/Global-Terrorism-Index-2017.pdf",
  output = "data.frame",
  pages = c(106, 106, 107, 107), # include pages twice to extract two tables per page
  area = list(
    c(182, 38, 787, 287),
    c(182, 298, 787, 543),
    c(78, 48, 781, 298),
    c(78, 308, 643, 558)
    ),
  guess = FALSE
  )

gti_table_clean <- reduce(gti_table, bind_rows) %>% # bind elements of list to 1 df
  as_tibble() %>%
  filter(!(X %in% c("GTI RANK", ""))) %>% # remove rownames and empty rows
  rename(
    gti_rank = X,
    county = X.1,
    gti_score = X.2,
    change_score = CHANGE.IN
    ) %>%
  mutate_at(vars(gti_rank, gti_score, change_score), as.numeric) %>% # convert to numeric 
  print()
#> # A tibble: 163 x 4
#>    gti_rank county      gti_score change_score
#>       <dbl> <chr>           <dbl>        <dbl>
#>  1        1 Iraq            10          -0.04 
#>  2        2 Afghanistan      9.44        0.004
#>  3        3 Nigeria          9.01        0.305
#>  4        4 Syria            8.62       -0.033
#>  5        5 Pakistan         8.4         0.214
#>  6        6 Yemen            7.88        0.198
#>  7        7 Somalia          7.65       -0.106
#>  8        8 India            7.53       -0.049
#>  9        9 Turkey           7.52       -0.777
#> 10       10 Libya            7.26        0.027
#> # … with 153 more rows

Created on 2018-10-27 by the reprex package (v0.2.1)

5 Likes

Hi @mfherman,

Thanks a ton ! I now understand how it has been done. Great explanation. One quick question. Can we automate this process ? Like, if I run the code I need to put this data to CSV. The next time I run this process I need to truncate the earlier data in csv and re-push the new data.

Glad it helped!

I'm not completely clear on what you would like to do, but if you have an existing csv file, you can append new rows to the existing file like this:

write_csv(my_df, "path/my.csv", append = TRUE)

If this isn't what you are looking to do, could you explain a little more your desired workflow?

1 Like

Hi @mfherman !
I am trying to run this piece of code similar to what I had done for the above pdf. However, i am facing issue and getting the following error. any idea what could be the problem ?

gpi_table1 <- extract_tables("http://visionofhumanity.org/app/uploads/2018/06/Global-Peace-Index-2018-2.pdf",
             output = "data.frame",
             pages = c(10,10,10,11,11,11), 
             area = list(
                       c(496, 38, 786, 169), 
                       c(496, 212, 786, 341),
                       c(496, 380, 786, 508),
                       c(496, 392, 738, 521),
                       c(496, 225, 788, 353),
                       c(496, 50, 786, 180)
                       
                        ),
             guess = FALSE
             )
gpi_table_clean1 <- reduce(gpi_table1, bind_rows)

ERROR:

Error in utils::download.file(path, tmp, quiet = quiet, mode = "wb") : 
  cannot open URL 'http://visionofhumanity.org/app/uploads/2018/06/Global-Peace-Index-2018-2.pdf'
In addition: Warning message:
In utils::download.file(path, tmp, quiet = quiet, mode = "wb") :
  InternetOpenUrl failed: 'The operation timed out'
> gpi_table_clean1 <- reduce(gpi_table1, bind_rows)
Error in is_empty(x) : object 'gpi_table1' not found

Any help here would be much appreciated.
Thanks.

Running your code works fine for me, but it seems like your local machine is having trouble downloading the PDF. Are you sure you had a working internet connection when you tried that code? Alternatively, you could download the PDF to your computer, and then use that local version as the data source rather than the web version.

1 Like

Hmm :thinking: restarting the RStudio and web-browser worked for me. Strange! :grinning:

Thanks for the suggestion :slight_smile:

1 Like

3 posts were split to a new topic: How to extract dataframe from a multipage table in a pdf file?