I need to scrape 5 times 365 pages, but dont know how

floris_e · September 24, 2020, 3:56pm

hi guys,

For a schoolproject i have to scrape a website which isn't a problem. But for it to be called BigData i wanted to scrape the whole archive(which is the past 5 years). The only thing that changes in the url is the date at the end of the url but i don't know how to write a script that changes only the date at the end.

The website I'm using is this: Archief met gebeurde Ongevallen gefilterd op datum.

And the dates i need are from 01-01-2015 until 24-09-2020. The first part of the code i already figured out and I'm able to scrape 1 page. I'm a beginner at using R and would like to know if anyone could help me. The code is shown below. Thanks in advance!

library(tidyverse)
library(rvest)

Scrape all pages of site

get_one_page <- function(url){

#scrape all elements
html<-read_html(url)

#get all descriptions on this page
all_title<-html %>%
html_nodes("h2") %>%
html_text()

#get all airing dates on this page
date<-html %>%
html_nodes(".text-muted") %>%
html_text()

Combine into a tibble

return(tibble(date = date, title = all_title))
}

#scrape multiple pages
scrape_write_table <- function(url){

list_of_pages <- str_c(url, )
list_of_pages
}

#Get element data from one page
url<- "https://www.ongelukvandaag.nl/archief/21-09-20"
ongelukken <- get_one_page(url)

vvalin · September 24, 2020, 7:10pm

Hi @floris_e you still need help?

Try this code. I had similar issues in the past trying to scrapp real estate pages. The point here is that in the first loop I retrieve all the links in connection with each day, and then I scrapp the data you need with the second one.

I run the code and it works for me. Advice: be patient because you're scrapping a big amount of data. Hope this helps:

library(rvest)
library(dplyr)
library(xml)
library(stringr)
library(jsonlite)
library(xml12)
library(purrr)
library(tidyr)
library(reshape)
library(XML)
library(robotstxt)
library(Rcrawler)
library(RSelenium)
library(ps)
library(devtools)
library(exifr)
library(Publish)

Create an url object

url<-"https://www.ongelukvandaag.nl/archief/%d"

Verify the web can be scrapped

paths_allowed(paths = c(url))

Obtain the links for every day from 2015 to 2020

map_df(2015:2020, function(i){
page<-read_html(sprintf(url,i))

data.frame(Links=html_attr(html_nodes(page, ".archief a"),"href"))
}) -> Links

Fix the urls retrieved

Links$Links<-paste("https://www.ongelukvandaag.nl/",Links$Links,sep = "")

Scrap what you want from each link:

d<- map(Links$Links, function(x) {

Z <- read_html(x)

Date <- Z %>% html_nodes(".text-muted") %>% html_text(trim = TRUE) # Last update
All_title <- Z %>% html_nodes("h2") %>% html_text(trim = TRUE) # Title

return(tibble(All_title,Date))

})

floris_e · September 28, 2020, 9:06am

Hi vvalin,

Thanks for helping me. But the code doesn't work for me because my RStudio can't find some of the packages. Do you know how i can fix this?

library(rvest)

library(dplyr)
library(xml)
Error in library(xml) : there is no package called ‘xml’
library(stringr)
library(jsonlite)
library(xml12)
Error in library(xml12) : there is no package called ‘xml12’
library(purrr)
library(tidyr)
library(reshape)
library(XML)
library(robotstxt)
library(Rcrawler)
library(RSelenium)
library(ps)
library(devtools)
library(exifr)
library(Publish)

#Verify the web can be scrapped
paths_allowed(paths = c(url))
www.ongelukvandaag.nl

[1] TRUE

#Obtain the links for every day from 2015 to 2020
map_df(2015:2020, function(i){

page<-read_html(sprintf(url,i))
data.frame(Links = html_attr(html_nodes(page, ".archief a"),"href"))
}) -> Links
Error in open.connection(x, "rb") : HTTP error 400. >

#Fix the urls retrieved
Links$Links<-paste("https://www.ongelukvandaag.nl/",Links$Links,sep = "")
Error in paste("https://www.ongelukvandaag.nl/", Links$Links, sep = "") :
object 'Links' not found

#Scrap what you want from each link:
d<- map(Links$Links, function(x) {

```
Z <- read_html(x)
```

Date <- Z %>% html_nodes(".text-muted") %>% html_text(trim = TRUE) # Last update

All_title <- Z %>% html_nodes("h2") %>% html_text(trim = TRUE) # Title

```
return(tibble(All_title,Date))
```
})
Error in map(Links$Links, function(x) { : object 'Links' not found

floris_e · September 28, 2020, 9:09am

Hi vvalin,

which version of RStudio have you used for this code?

francisbarton · September 28, 2020, 4:11pm

Hi there @floris_e

You made a pretty great attempt at a function for a newcomer to R.
You don't need all the RSelenium, JSON, XML and EXIF tools that have been suggested above.

The site is perfectly scrapeable with rvest and friends. (rvest imports xml2 - you don't need to load this separately).

I've amended your function to work with the polite package that encourages responsible web scraping. This solution assumes (I think, correctly) that a web page exists for every date from 2015-01-01 to 2020-09-24.

It's paste0 that creates a vector of URLs by combining the base URL with the series of dates. Then the purrr::map function passes the URLs one by one to the scrape function (and combines the results into a single data frame using the map_dfr variant).

See my comments on the code below.

library(dplyr)
library(lubridate)
library(polite) # https://dmi3kno.github.io/polite/
library(purrr)
library(rvest)
library(stringr)

url_root <- "https://www.ongelukvandaag.nl/archief/"

session <- polite::bow(
  url = url_root,
  delay = 5)      # 5s delay for responsible scraping


scrape_page <- function(url) {
  page_text <- polite::nod(session, url) %>% 
    # xml2::read_html is built in to polite::scrape I think, so not needed here
    polite::scrape()
  
  headings <- page_text %>% 
    rvest::html_nodes("h2") %>% 
    rvest::html_text()
    
  dates <- page_text %>% 
    rvest::html_nodes(".text-muted") %>%
    rvest::html_text() %>% 
    stringr::str_extract("[0-9]{2}-[0-9]{2}-[0-9]{4}") # just extract the date
  
  dplyr::tibble(headings = headings, dates = dates)
}

as_date(ymd("2015-01-01"):ymd("2020-09-24")) %>% 
  stringr::str_replace_all(pattern = "([0-9]{4})-([0-9]{2})-([0-9]{2})", replacement = "\\3-\\2-\\1") %>% 
  paste0(url_root, .) %>% 
  sample(3) %>%  #  just use a sample for testing - remove this line for full scrape
  purrr::map_dfr(scrape_page) %>% # combine results to a single tibble
  dplyr::mutate(dates = lubridate::dmy(dates)) # convert to a valid date
#> # A tibble: 35 x 2
#>    headings                                                           dates     
#>    <chr>                                                              <date>    
#>  1 Fietser gewond na aanrijding bij viertonde Stadskanaal.            2016-11-19
#>  2 Automobilist gewond bij eenzijdig ongeval op de Hopeseweg.         2016-11-19
#>  3 Gewonde en veel schade na ongeval in Elspeet.                      2016-11-19
#>  4 Motorrijder komt om het leven na botsing met auto.                 2016-11-19
#>  5 Automobilist zwaargewond na ongeluk op Nieuwendijk in Someren-Hei~ 2016-11-19
#>  6 Ook machinist verhoord over ongeluk Winsum.                        2016-11-19
#>  7 Treinongeluk: 18 gewonden, waarvan drie zwaargewond.               2016-11-19
#>  8 Motorrijder gewond na eenzijdig ongeval Rijswijk.                  2016-11-19
#>  9 Auto total loss na botsing verkeerspaal Ettensebaan Breda.         2016-11-19
#> 10 Spoor Winsum nog dagen dicht na treinongeval.                      2016-11-19
#> # ... with 25 more rows

^{Created on 2020-09-28 by the reprex package (v0.3.0)}

Please note that with the 5s wait between scrapes, the process will take at least

library(lubridate)
as.period(length(ymd("2015-01-01"):ymd("2020-09-24")) * seconds(5), unit = "hours")

to complete
You can just use read_html() and not use polite if you need it all to happen more quickly.

vvalin · September 28, 2020, 9:35pm

Hello @francisbarton thank you very much for your post. As you could see, I am not very experienced with R programming. I tried to help @floris_e with some code I built for some of my web scraping projects. That's why the code has some redundant packages.

I love the idea of doing friendly web scraping. Actually, I did not know about polite package but I will use it for subsequent projects. Very insightful the idea of having read_html already included in polite::scrape.

Again, thank you very much!

Regards

francisbarton · September 28, 2020, 9:48pm

@vvalin it's ok glad to help. I have not yet used RSelenium successfully. I know it's needed for some scraping where the page is generated dynamically.

vvalin · September 28, 2020, 10:07pm

Hey @floris_e! Actually, I do not know exactly why you are getting that error. My Rstudio version is 1.2.1335 and R version 4.0.2 (2020-06-22). I have checked your code, and you're not assigning a url object, but I'am not sure that's the real problem.

Regarding the packages not found, I just realized xml and xml12 don't exist!! My bad! Instead you can type XML and xml2. If the error keeps, let me know.

In any case, I suggest to follow @francisbarton code. I run it on my computer and it works perfectly. Moreover, you will do responsible web scraping which I find quite interesting too.

Hope you get the data you want!

vvalin · September 28, 2020, 10:13pm

To be honest, I tried to use RSelenium once and could not get anything. Yes, it is aimed at scraping webs built with JavaScript. Maybe one day I get how to use it correctly

floris_e · September 29, 2020, 9:53am

Hi @francisbarton! Thank you very much for helping me with this. I only have two more questions :), do i need to change anything to the code? And how can i get the scraped data into a viewable dataset? Again thanks very much!

francisbarton · September 29, 2020, 12:15pm

The code is a reprex which means that you can run it as it is, and you should get the same result I got. It's a great way of sharing code on forums like this - because the example is self-contained, it's portable from my machine to yours.

As you'll see if you read the comments I wrote on the code:

I have extracted just the date from the ".text-muted" (last updated) sections of the pages - but you might not want this?
I have also inserted the line sample(3) in the code so it just samples three of the dates at random. Obviously you'll want to remove this line when you want to get the text from all 2,094 dates. But make sure the function is working as you want first, before scraping the whole dataset.
The end result is a single tibble (data frame) - you can see this if you look at the code I posted. If you want to export this object to a file you can use readr::write_csv for example.

You can also add your own custom user string to the initial polite::bow call, using the user_agent parameter. This could include your name and email address. See ?polite::bow and this article.

floris_e · September 29, 2020, 1:32pm

hi Francis,

I've finally figured out how i can use the data. Thanks again haha! You're a legend.

system · October 6, 2020, 1:33pm

This topic was automatically closed 7 days after the last reply. New replies are no longer allowed.

If you have a query related to it or one of the replies, start a new topic and refer back with a link.