Hi there @floris_e
You made a pretty great attempt at a function for a newcomer to R.
You don't need all the RSelenium, JSON, XML and EXIF tools that have been suggested above.
The site is perfectly scrapeable with rvest
and friends. (rvest
imports xml2
- you don't need to load this separately).
I've amended your function to work with the polite
package that encourages responsible web scraping. This solution assumes (I think, correctly) that a web page exists for every date from 2015-01-01 to 2020-09-24.
It's paste0
that creates a vector of URLs by combining the base URL with the series of dates. Then the purrr::map
function passes the URLs one by one to the scrape function (and combines the results into a single data frame using the map_dfr
variant).
See my comments on the code below.
library(dplyr)
library(lubridate)
library(polite) # https://dmi3kno.github.io/polite/
library(purrr)
library(rvest)
library(stringr)
url_root <- "https://www.ongelukvandaag.nl/archief/"
session <- polite::bow(
url = url_root,
delay = 5) # 5s delay for responsible scraping
scrape_page <- function(url) {
page_text <- polite::nod(session, url) %>%
# xml2::read_html is built in to polite::scrape I think, so not needed here
polite::scrape()
headings <- page_text %>%
rvest::html_nodes("h2") %>%
rvest::html_text()
dates <- page_text %>%
rvest::html_nodes(".text-muted") %>%
rvest::html_text() %>%
stringr::str_extract("[0-9]{2}-[0-9]{2}-[0-9]{4}") # just extract the date
dplyr::tibble(headings = headings, dates = dates)
}
as_date(ymd("2015-01-01"):ymd("2020-09-24")) %>%
stringr::str_replace_all(pattern = "([0-9]{4})-([0-9]{2})-([0-9]{2})", replacement = "\\3-\\2-\\1") %>%
paste0(url_root, .) %>%
sample(3) %>% # just use a sample for testing - remove this line for full scrape
purrr::map_dfr(scrape_page) %>% # combine results to a single tibble
dplyr::mutate(dates = lubridate::dmy(dates)) # convert to a valid date
#> # A tibble: 35 x 2
#> headings dates
#> <chr> <date>
#> 1 Fietser gewond na aanrijding bij viertonde Stadskanaal. 2016-11-19
#> 2 Automobilist gewond bij eenzijdig ongeval op de Hopeseweg. 2016-11-19
#> 3 Gewonde en veel schade na ongeval in Elspeet. 2016-11-19
#> 4 Motorrijder komt om het leven na botsing met auto. 2016-11-19
#> 5 Automobilist zwaargewond na ongeluk op Nieuwendijk in Someren-Hei~ 2016-11-19
#> 6 Ook machinist verhoord over ongeluk Winsum. 2016-11-19
#> 7 Treinongeluk: 18 gewonden, waarvan drie zwaargewond. 2016-11-19
#> 8 Motorrijder gewond na eenzijdig ongeval Rijswijk. 2016-11-19
#> 9 Auto total loss na botsing verkeerspaal Ettensebaan Breda. 2016-11-19
#> 10 Spoor Winsum nog dagen dicht na treinongeval. 2016-11-19
#> # ... with 25 more rows
Created on 2020-09-28 by the reprex package (v0.3.0)
Please note that with the 5s wait between scrapes, the process will take at least
library(lubridate)
as.period(length(ymd("2015-01-01"):ymd("2020-09-24")) * seconds(5), unit = "hours")
to complete 
You can just use read_html()
and not use polite
if you need it all to happen more quickly.