cleaning pdf text into tidy format

I have pdf text that I need converted into "tidy" format. But I'm unsure about how to read in the pdf text without compromising the information I need. For example:

# install pacman package if you require it
if (!require("pacman")) install.packages("pacman")

# p_load installs and loads packages

pacman::p_load(tidyverse, pdftools, tabulizer)

pdf_txt_raw <- pdf_text("https://www.statcan.gc.ca/eng/statistical-programs/document/5027_D1_V10-eng.pdf") %>% read_lines()

pdf_txt_raw

Using read_lines() seems to give an error because whenever there are two lines in the "legal name" column, it messes up the tidy format I'm looking for. For example, the Loblaw Inc [4] should be fine to clean up because each operating name is separated by a comma and it is within the Loblaws line, giving me a clean category.

But the very fist legal name category is wrong due to a line break in the PDF - i.e., "Buy-Low Foods Limited Partnership" should be the legal name and the operating names within that category should be "AG Foods, Buy-Low Foods, Buy & Save Foods, Fine Foods, G&H Shop N' Save, Nesters Market".

Any tips on how to clean this properly and get the tidy format I'm looking for?

Found this this article, did you already do pdf_txt_raw[1:5] or however you want to extract the right rows?

This topic was automatically closed 21 days after the last reply. New replies are no longer allowed.