How to extract the page number from a .pdf by a string and split it?

Hi, guys!

I have a .pdf with 120 certificates, each page is a certificate and the only difference is the name of the participant.

I also have a .csv with the name and e-mail (I will also try to send by e-mail with R later).

How can I split each certificate (page) and save in a new .pdf with the participant name?

I saw functions like pdf_subset from library(pdftools), but how can I identify the page number by some text?

# extract some pages
pdf_subset('https://cran.r-project.org/doc/manuals/r-release/R-intro.pdf',
  pages = 1:3, output = "subset.pdf")

Example of .pdf (sorry, it is in portuguese)
https://drive.google.com/file/d/1iwgW6kMT7C9Xee5SM65vz-D8B26bpavz/view?usp=sharing

in the .csv I have the column name

name,email
Prof. Dr. Thiado Souza,thiado@gmail.com
Prof. Dr. Marcelo José,marcelo@uol.com
Ricado Augusto,ricado@terra.com
Carlos José,carlosj@hotmail.com

Splitting PDF's really, really, REALLY, REALLY isn't an R kind of thing.

If I absolutely had to do it and I wanted to involve R, I personally would lean on some external software, specifically ghostscript.

I saw you just updated your question... Yes, you could try to use pdftools, and it might be simpler than learning something new specifically for manipulating PDF, but it won't be as powerful.

As to your question though... You will need to install the tesseract package in addition to pdftools, then after you split the pdf into smaller pdfs, you'll need to run the funtion pdf_ocr_text() on each of them to get the text on the page as a character vector, then you can use standard string manipulation functions to find and extract the page number which should be the final printable substring on the page.

1 Like

Sad to hear it :frowning: I don't want to do it manually and I just know R.

I will take a look at this gostscript.

Thanks for the suggestion!

Honestly, you can do it in R if you want, but some tools are better than others for certain jobs. R just wouldn't be my first choice of tool to break up and rename a LOT of PDFs...

I wouldn't be able to help you with the regex for extracting the participant names without actually having access to the PDF, nor do I even know how possible it would be. But, breaking up a pdf into individual pages could be done by...

install.packages("pdftools")
library(pdftools)
#> Using poppler version 0.73.0

download.file("https://cran.r-project.org/doc/manuals/r-release/R-intro.pdf",
              destfile = "r-intro.pdf", mode = "wb")
get_pages <- function(pdf, pages = seq_len(pdf_length(pdf))) {
  get_one_page <- function(pdf, page) {
    pdf_subset(pdf,
               pages = page,
               output = paste0(strsplit(pdf, "\\.")[[c(1, 1)]],
                               " ",
                               page,
                               ".pdf"))
  }
  get_the_pages <- Vectorize(get_one_page,
  vectorize.args = "page")
  get_the_pages(pdf, pages)
}

pdf <- "r-intro.pdf"
get_pages(pdf)

Created on 2020-09-01 by the reprex package (v0.3.0)

Extracting the names will be a bit more difficult, especially to help with remotely, but once you can do that, renaming them will be easy.

1 Like

I will try it later! Thanks!

I add the example of .pdf and .csv file.

It worked fine! thanks!

This topic was automatically closed 21 days after the last reply. New replies are no longer allowed.

If you have a query related to it or one of the replies, start a new topic and refer back with a link.