I was wondering whether someone could help me understand what's going on with my attempt to scrap a few pages from a site. Although I get an error message (404) and use the possibly
adverb of the map
command, I do not get the NULL
as defined in the 'otherwise
' option, but some other data. And I have no idea what this data is.
``` r
library(rvest)
#> Lade nötiges Paket: xml2
library(tidyverse)
#> Warning: Paket 'ggplot2' wurde unter R Version 3.5.3 erstellt
#> Warning: Paket 'tibble' wurde unter R Version 3.5.3 erstellt
#> Warning: Paket 'tidyr' wurde unter R Version 3.5.3 erstellt
#> Warning: Paket 'readr' wurde unter R Version 3.5.2 erstellt
#> Warning: Paket 'purrr' wurde unter R Version 3.5.3 erstellt
#> Warning: Paket 'dplyr' wurde unter R Version 3.5.2 erstellt
#> Warning: Paket 'forcats' wurde unter R Version 3.5.2 erstellt
library(glue)
#> Warning: Paket 'glue' wurde unter R Version 3.5.3 erstellt
#>
#> Attache Paket: 'glue'
#> The following object is masked from 'package:dplyr':
#>
#> collapse
#here I create the vector; I know that category 358 has less than 5 pages;
seq_categories <- c(350, 358, 366)
seq_pages <- seq(1, 5, 1)
df_seq <- expand.grid(seq_pages=seq_pages, seq_categories=seq_categories)
df_links <- df_seq %>%
mutate(link=glue("http://www.ohr.int/?cat={seq_categories}&paged={seq_pages}"))
#define function
scr_bonn <- function(link) {
pb$tick()$print()
print(link)
site <- read_html(link)
if(!is.na(site)) {
date.publish <- site %>%
html_nodes(".date-publish") %>%
html_text() %>%
enframe(name=NULL, value="date.publish")
decision.name <- site %>%
html_nodes(".name") %>%
html_text %>%
enframe(name=NULL, value="decision.name")
# print(decision.name)
bind_cols(date.publish=date.publish, decision.name=decision.name)
}
}
pb <- progress_estimated(nrow(df_links))
#map function
df_scrap_results <- df_links$link %>%
set_names() %>%
map_dfr(., possibly(scr_bonn, otherwise=NULL, quiet=FALSE), .id="link_to_page")
#> http://www.ohr.int/?cat=350&paged=1
#> http://www.ohr.int/?cat=350&paged=2
#> http://www.ohr.int/?cat=350&paged=3
#> http://www.ohr.int/?cat=350&paged=4
#> http://www.ohr.int/?cat=350&paged=5
#> http://www.ohr.int/?cat=358&paged=1
#> http://www.ohr.int/?cat=358&paged=2
#> http://www.ohr.int/?cat=358&paged=3
#> Error: HTTP error 404.
#> http://www.ohr.int/?cat=358&paged=4
#> Error: HTTP error 404.
#> http://www.ohr.int/?cat=358&paged=5
#> Error: HTTP error 404.
#> http://www.ohr.int/?cat=366&paged=1
#> http://www.ohr.int/?cat=366&paged=2
#> http://www.ohr.int/?cat=366&paged=3
#> http://www.ohr.int/?cat=366&paged=4
#> http://www.ohr.int/?cat=366&paged=5
#extract pages and categories from mapped link
df_scrap_results <- df_scrap_results %>%
mutate(page=str_extract(link_to_page, "(?<=paged\\=)[:digit:]+") %>% as.numeric) %>%
mutate(category=str_extract(link_to_page, "(?<=cat\\=)[:digit:]+") %>% as.numeric)
#results for 'non-existant' links; category 358 has results for links/pages which throw an error
wrong <- df_scrap_results %>%
filter(page>2 & category=="358")
wrong
#> # A tibble: 12 x 5
#> link_to_page date.publish decision.name page category
#> <chr> <chr> <chr> <dbl> <dbl>
#> 1 http://www.ohr.in~ 01/05/2011 Order Suspending the App~ 4 358
#> 2 http://www.ohr.in~ 09/12/2009 OHR Inventory Team Estab~ 4 358
#> 3 http://www.ohr.in~ 06/25/2008 Decision Amending the La~ 4 358
#> 4 http://www.ohr.in~ 06/25/2008 Decision Amending the La~ 4 358
#> 5 http://www.ohr.in~ 06/25/2008 Decision Amending the La~ 4 358
#> 6 http://www.ohr.in~ 12/19/2007 Decision Amending the La~ 4 358
#> 7 http://www.ohr.in~ 12/19/2007 Decision Amending the La~ 4 358
#> 8 http://www.ohr.in~ 12/19/2007 Decision Amending the La~ 4 358
#> 9 http://www.ohr.in~ 09/28/2007 Decision Amending the La~ 4 358
#> 10 http://www.ohr.in~ 09/28/2007 Decision Amending the La~ 4 358
#> 11 http://www.ohr.in~ 09/28/2007 Decision Amending the La~ 4 358
#> 12 http://www.ohr.in~ 09/14/2007 Decision Withdrawing the~ 4 358
Warning messages:
1: In .Internal(parent.frame(n)) :
closing unused connection 5 (http://www.ohr.int/?cat=358&paged=5)
2: In .Internal(parent.frame(n)) :
closing unused connection 4 (http://www.ohr.int/?cat=358&paged=4)
3: In .Internal(parent.frame(n)) :
closing unused connection 3 (http://www.ohr.int/?cat=358&paged=3)
Created on 2019-05-07 by the reprex package (v0.2.1)
When I look into the results I see that it contains data for pages which actually do not exist. I know that category 358 has only 2 pages; nevertheless my iteration went up to the end of the created vector (5) and scrapped data which I haven't figured out where it is coming from. Instead of 'skipping' the excessive index/page numbers, it retrieved some data.
My understanding of the adverb of possibly
was that in case the function throws an error, it skips the iteration and moves to the next instance. In this case, the function throws an error, but instead of filling in the NULL of 'otherwise' I get some data.
Any idea what's going on?