Digitize Tables from PDF blank cells and two-lines cells

I am digitizing some tables extracted from PDF. The tables have the same structure, but there are two main issues I can't get through.

First, sometimes entries inside cells are written over two lines. This may happen in every column (i.e. columns Adresse (Village/Avenue) or Nom du site de vote etc).

Second, sometimes columns are empty. It mostly happens with column Groupement /Quartier . This issue makes the digitization and the placement of data inside columns not systematic.

I am interested in digitizing the main table. Here is the original data https://www.ceni.cd/assets/bundles/documents/cadre-legal/cadre-legal_1545812547.pdf.

My strategy has been to split the original PDF of 1800+ pages into single pages pdf. Then to digitize each page separately in a list, then, based on the number of columns inside the list to treat the problems separately.

The code below works fine when each cell is filled and contains only one line. However, when a column starts with blanks or when entries are over two lines things don't get properly sorted.

Does anyone have an idea for how to solve these issues?

Thank you!

library(dplyr)
library(tabulizer)

headers <- c('Numero', 'NomSV', 'Sect_Chef_Com', 'Group_Quart', 'Adress_Vill_Aven',  
             'Nbre_CV', 'Nbre_BVD', 'CodeSV', 'Plage')

my_list <- list()

# Loop through all pages, digitization of each page one by one
for (i in seq(1:10)){
# Location of CENI pdf file
location <- paste0("my_path/page",i,".pdf")

out <- extract_tables(location)

final <- do.call(rbind, out)

final <- as.data.frame(final[3:nrow(final), ])

my_list[[i]] <- final

remove(final)

}

####### DF with 10 variables are easy to deal with
list10var <- my_list[sapply(my_list,ncol) == 10]

# ALign all columns and remove empty rows
# the function remove na.omit and use select_if(~ !all(is.na(.))) to remove column V5
my_function <- function(df) {
  
  df %>%
    mutate(V1=lag(V1, n=2)) %>%
    mutate(V2=lag(V2)) %>%
    mutate(V3=lag(V3)) %>%
    mutate(V4=lag(V4)) %>%
    mutate(V6=lag(V6)) %>%
    filter(V1 != '') %>%
    select_if(~ !all(is.na(.)))
}

# apply the function to the list of dataframes
list10var <- lapply(list10var, my_function)

final10var <- do.call("rbind", list10var)

``

This topic was automatically closed 21 days after the last reply. New replies are no longer allowed.

If you have a query related to it or one of the replies, start a new topic and refer back with a link.