I'm trying to find a way to analyze the text of pdf documents in R. Ideally, I want to get an R object with the document content where the text flow would not be interrupted by headers/footnotes/page numbers, etc.
pdftools::pdf_text() function that can convert pdf documents into character vectors. The problem is that it just "ruins" the text as it does not consider the document layout (see example below; scroll to the right to the the problem).
Can anyone suggest another tool to read pdfs in R?
suppressMessages(library(pdftools)) suppressMessages(library(tidyverse)) test_pdf <- pdftools::pdf_text(pdf = "https://www.molbiolcell.org/doi/pdf/10.1091/mbc.E20-09-0582") by_row_pdf <- stringr::str_split(test_pdf, pattern = "\n") ## In this part of the document there were notes to the right of the main text ## This notes "break" the main text by_row_pdf[][19:22] #>  "ABSTRACT Brush border microvilli enable functions that are critical for epithelial homeosta- Monitoring Editor" #>  "sis, including solute uptake and host defense. However, the mechanisms that regulate the William Bement" #>  " University of Wisconsin," #>  "assembly and morphology of these protrusions are poorly understood. The parallel actin" ## This part of document has text arranged in 2 columns, ## but during the conversion it was not taken into account head(by_row_pdf[]) #>  "little is known about how it contributes to the apical domain struc- A previous proteomic study by our laboratory revealed that brush" #>  "ture, microvillar organization, or brush border function. Interestingly, border fractions isolated from mouse small intestine contain all three" #>  "knockout (KO) mouse models lacking major brush border structural NM2 paralogues, with NM2C exhibiting high-level abundance (Mc-" #>  "components, such as PACSIN-2, plastin-1, or multiple actin-bun- Connell et al., 2011). Although NM2C remains the most poorly un-" #>  "dling proteins (villin, espin, and plastin-1), exhibit significant pertur- derstood paralogue with regard to biophysical properties and physi-" #>  "bations to the terminal web (Grimm-Gunter et al., 2009; Revenu ological function, previous work established that this isoform exhibits"