I'm trying to find a way to analyze the text of pdf documents in R. Ideally, I want to get an R object with the document content where the text flow would not be interrupted by headers/footnotes/page numbers, etc.
I've found pdftools::pdf_text()
function that can convert pdf documents into character vectors. The problem is that it just "ruins" the text as it does not consider the document layout (see example below; scroll to the right to the the problem).
Can anyone suggest another tool to read pdfs in R?
suppressMessages(library(pdftools))
suppressMessages(library(tidyverse))
test_pdf <- pdftools::pdf_text(pdf = "https://www.molbiolcell.org/doi/pdf/10.1091/mbc.E20-09-0582")
by_row_pdf <- stringr::str_split(test_pdf, pattern = "\n")
## In this part of the document there were notes to the right of the main text
## This notes "break" the main text
by_row_pdf[[1]][19:22]
#> [1] "ABSTRACT Brush border microvilli enable functions that are critical for epithelial homeosta- Monitoring Editor"
#> [2] "sis, including solute uptake and host defense. However, the mechanisms that regulate the William Bement"
#> [3] " University of Wisconsin,"
#> [4] "assembly and morphology of these protrusions are poorly understood. The parallel actin"
## This part of document has text arranged in 2 columns,
## but during the conversion it was not taken into account
head(by_row_pdf[[2]])
#> [1] "little is known about how it contributes to the apical domain struc- A previous proteomic study by our laboratory revealed that brush"
#> [2] "ture, microvillar organization, or brush border function. Interestingly, border fractions isolated from mouse small intestine contain all three"
#> [3] "knockout (KO) mouse models lacking major brush border structural NM2 paralogues, with NM2C exhibiting high-level abundance (Mc-"
#> [4] "components, such as PACSIN-2, plastin-1, or multiple actin-bun- Connell et al., 2011). Although NM2C remains the most poorly un-"
#> [5] "dling proteins (villin, espin, and plastin-1), exhibit significant pertur- derstood paralogue with regard to biophysical properties and physi-"
#> [6] "bations to the terminal web (Grimm-Gunter et al., 2009; Revenu ological function, previous work established that this isoform exhibits"