It is important to remember that scraping data from PDF is more often than not a fairly tedious task. The techniques you use to scrape data from one PDF will unfortunately not be 100% useful when trying to scrape data from another PDF. The code I provide below will scrape Table 9's data from the article.
# Load packages ----
library(here)
library(purrr)
library(pdftools)
library(stringr)
library(tesseract)
# Extract content of page 23 ----
raw_data <- pdftools::pdf_ocr_text(pdf = here("data/jojosouza.pdf"), pages = 23)
# Keep content related to Table 9 ----
split_data <- str_split(string = raw_data, pattern = "\\n")
start_i <- str_which(string = unlist(split_data), pattern = "Table 9")
end_i <- str_which(string = unlist(split_data), pattern = "Table 10")
raw_table9_data <- unlist(split_data)[(start_i + 1):(end_i - 1)]
# Write a function to clean the data ----
transform_to_row <- function(raw){
model <- str_remove_all(
string = raw,
pattern = "\\s\\d+[\\d\\.]*"
)
values <- str_extract_all(
string = raw,
pattern = "\\s\\d+[\\d\\.]*"
) %>% unlist() %>% str_squish() %>% as.numeric() %>% t() %>% as.data.frame()
data.frame(x = model, values)
}
# Apply the function and set column names
final <- map_dfr(raw_table9_data[-1], transform_to_row) %>%
setNames(c("Classifiers", "TOPSIS", "GRA", "VIKOR", "PROMETHEE II", "ELECTRE III"))
# Final data
final
Classifiers TOPSIS GRA VIKOR PROMETHEE II ELECTRE III
1 Bayes net 0.8005 0.9476 0.5309 0.8687 0.8635
2 Naive Bayes 0.7336 0.6903 0.5322 0.6970 0.7686
3 Incremental Naive Bayes 0.7336 0.6903 0.5322 0.6162 0.7705
4 IB1 0.6549 0.5069 0.4339 0.3232 0.4628
5 AdaBoost M1 0.9633 0.9533 0.8330 0.8788 0.9300
6 HyperPipes 0.0000 0.0000 0.0000 0.1717 0.0964
7 VFI 0.3610 0.4120 0.1752 0.3636 0.4367
8 Conjunctive rule 0.0308 0.0195 0.0152 0.2121 0.1232
9 Decision table 0.8307 0.8956 0.4873 0.8384 0.9110
10 OneR 0.4106 0.3484 0.2194 0.3535 0.4020
11 PART 1.0000 0.9888 1.0000 0.9394 1.0000
12 ZeroR 0.0289 0.0177 0.0130 0.0000 0.0000
13 Decision stump 0.1678 0.1257 0.0839 0.0808 0.1360
14 C4.5 0.8569 0.9914 0.5070 0.9293 0.9261
15 Grafted C4.5 0.8672 1.0000 0.5109 1.0000 0.9442
16 Random tree 0.5830 0.4283 0.3034 0.4040 0.4912
17 REP tree 0.7838 0.7665 0.4097 0.5960 0.7890