Read pdf document in R

Dobrokhotov1989 · June 28, 2021, 9:18am

I'm trying to find a way to analyze the text of pdf documents in R. Ideally, I want to get an R object with the document content where the text flow would not be interrupted by headers/footnotes/page numbers, etc.

I've found pdftools::pdf_text() function that can convert pdf documents into character vectors. The problem is that it just "ruins" the text as it does not consider the document layout (see example below; scroll to the right to the the problem).

Can anyone suggest another tool to read pdfs in R?

suppressMessages(library(pdftools))
suppressMessages(library(tidyverse))


test_pdf <- pdftools::pdf_text(pdf = "https://www.molbiolcell.org/doi/pdf/10.1091/mbc.E20-09-0582")

by_row_pdf <- stringr::str_split(test_pdf, pattern = "\n")

## In this part of the document there were notes to the right of the main text
## This notes "break" the main text
by_row_pdf[[1]][19:22]
#> [1] "ABSTRACT Brush border microvilli enable functions that are critical for epithelial homeosta-                                  Monitoring Editor"       
#> [2] "sis, including solute uptake and host defense. However, the mechanisms that regulate the                                      William Bement"          
#> [3] "                                                                                                                              University of Wisconsin,"
#> [4] "assembly and morphology of these protrusions are poorly understood. The parallel actin"

## This part of document has text arranged in 2 columns, 
## but during the conversion it was not taken into account
head(by_row_pdf[[2]])
#> [1] "little is known about how it contributes to the apical domain struc-             A previous proteomic study by our laboratory revealed that brush"   
#> [2] "ture, microvillar organization, or brush border function. Interestingly,     border fractions isolated from mouse small intestine contain all three" 
#> [3] "knockout (KO) mouse models lacking major brush border structural             NM2 paralogues, with NM2C exhibiting high-level abundance (Mc-"         
#> [4] "components, such as PACSIN-2, plastin-1, or multiple actin-bun-              Connell et al., 2011). Although NM2C remains the most poorly un-"       
#> [5] "dling proteins (villin, espin, and plastin-1), exhibit significant pertur-   derstood paralogue with regard to biophysical properties and physi-"    
#> [6] "bations to the terminal web (Grimm-Gunter et al., 2009; Revenu               ological function, previous work established that this isoform exhibits"

DavoWW · June 28, 2021, 10:30am

Hi @Dobrokhotov1989,
According to this SO link:

https://stackoverflow.com/questions/42541849/extract-text-from-two-column-pdf-with-r

you can read/process multi-column PDF files using tabulizer::extract_text(file).

Dobrokhotov1989 · June 29, 2021, 3:13am

Thank you for the suggestion. tabulizer::extract_text(file) does the work much better, but still not perfect.
It handles 2 columns well, but when a section of the text starts at one page and continues on the other, it still does not understand where is the main text and where is notes, thus putting notes between the lines of the main text. I don't know the "inner structure" of pdfs. Maybe there is no information embedded that would allow segregating the main text from any axillary information.

So, any other suggestions are welcome.

suppressMessages(library(tabulizer))
suppressMessages(library(tidyverse))

test <- tabulizer::extract_text(file = "https://www.molbiolcell.org/doi/pdf/10.1091/mbc.E20-09-0582")

test_spl <- stringr::str_split(test, "\n")

## Below elements 1, 2, and 7 is a main text that separated
## with ~ 30 elements from footnotes and etc.
test_spl[[1]][c(45:48, 74:76)]
#> [1] "and Tilney, 1975; Hull and Staehelin, 1979). While the terminal web \r"
#> [2] "was first described several decades ago in ultrastructural studies, \r"
#> [3] "Monitoring Editor\r"                                                   
#> [4] "William Bement\r"                                                      
#> [5] "Vanderbilt University Medical Center, Nashville, TN 37232\r"           
#> [6] "2804 | C. R. Chinowsky et al. Molecular Biology of the Cell\r"         
#> [7] "little is known about how it contributes to the apical domain struc-\r"

system · July 20, 2021, 3:14am

This topic was automatically closed 21 days after the last reply. New replies are no longer allowed.

If you have a query related to it or one of the replies, start a new topic and refer back with a link.