Hi,
Manx many thanks for the help! It works just fine.
I know it's an awful dataframe. I asked the organization to provide me with the data (at least in xlsx format), but they did not follow up, the reason being that all the data is available online (but very difficult to work with).
With the following lines of code, I could remove the problematic lines:
dta_med_1996 <- dta_med_1996[-1:-3]
dta_med_1996 <- dta_med_1996[-57]
As your solution is better, I will rely on the tabulizer package.
As an edit, in order to have the headers right, I have integrated the name of the variable "Spécialité" in the list of cantons:
cantons <- c("Spécialité", "ZH", "BE", "LU", "UR", "SZ", "OW", "NW", "GL", "ZG", "FR", "SO", "BS", "BL", "SH", "AR", "AI", "SG", "GR", "AG", "TG", "TI", "VD", "VS", "NE", "GE", "JU", "Total")
Then I drop the first 2 rows (as you mentioned they should be treated separately). So the whole code to extract the data from the pdf:
library(pdftools)
library(tidyverse)
library(tabulizer)
cantons <- c("Spécialités", "ZH", "BE", "LU", "UR", "SZ", "OW", "NW", "GL", "ZG", "FR", "SO", "BS", "BL", "SH", "AR",
"AI", "SG", "GR", "AG", "TG", "TI", "VD", "VS", "NE", "GE", "JU", "Total")
med_1996 <- "https://www.fmh.ch/files/pdf5/stat1996.pdf"
dta_med_1996 <- as_tibble(extract_tables(med_1996, pages = 9, encoding = "UTF-8")[[1]]) %>%
mutate(across(where(is.character), ~na_if(.,"")))
dta_med_1996 <- dta_med_1996 %>%
`names<-`(cantons) %>% #insert new headers
slice(3:58) #drop the first 2 rows
which gives the following output:
Spécialités ZH BE LU UR SZ OW NW GL ZG FR SO BS BL SH AR AI SG GR AG TG TI VD VS NE GE JU Total
<chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr>
1 Allgemeinmedizin / Médecine générale 479 417 145 15 45 11 14 16 33 58 98 68 104 36 24 3 163 94 174 91 103 173 97 40 85 12 2598
2 spez. Arbeitsmedizin / spéc. médecine du travail 3 3 3 NA NA NA NA NA NA NA 1 5 3 NA NA NA NA NA 2 1 1 6 NA NA NA NA 28
3 Total Allgemeinmedizin / total médecine générale 482 420 148 15 45 11 14 16 33 58 99 73 107 36 24 3 163 94 176 92 104 179 97 40 85 12 2626
4 Anästhesiologie / Anesthésiologie 101 91 20 2 3 1 2 2 9 21 7 24 19 2 4 NA 23 12 33 3 21 64 15 13 36 5 533
Once again, many thanks for the help and the solution!
Best regards,
SL