how to skip non-existing index when scrapping pages?

I was wondering whether someone could help me understand what's going on with my attempt to scrap a few pages from a site. Although I get an error message (404) and use the possibly adverb of the map command, I do not get the NULL as defined in the 'otherwise' option, but some other data. And I have no idea what this data is.

``` r
library(rvest)
#> Lade nötiges Paket: xml2
library(tidyverse)
#> Warning: Paket 'ggplot2' wurde unter R Version 3.5.3 erstellt
#> Warning: Paket 'tibble' wurde unter R Version 3.5.3 erstellt
#> Warning: Paket 'tidyr' wurde unter R Version 3.5.3 erstellt
#> Warning: Paket 'readr' wurde unter R Version 3.5.2 erstellt
#> Warning: Paket 'purrr' wurde unter R Version 3.5.3 erstellt
#> Warning: Paket 'dplyr' wurde unter R Version 3.5.2 erstellt
#> Warning: Paket 'forcats' wurde unter R Version 3.5.2 erstellt
library(glue)
#> Warning: Paket 'glue' wurde unter R Version 3.5.3 erstellt
#> 
#> Attache Paket: 'glue'
#> The following object is masked from 'package:dplyr':
#> 
#>     collapse


#here I create the vector; I know that category 358 has less than 5 pages;
seq_categories <- c(350, 358, 366)
seq_pages <- seq(1, 5, 1)
df_seq <- expand.grid(seq_pages=seq_pages, seq_categories=seq_categories)
df_links <- df_seq %>%
  mutate(link=glue("http://www.ohr.int/?cat={seq_categories}&paged={seq_pages}"))

#define function
scr_bonn <- function(link) {
  
  pb$tick()$print()
  
  print(link)
  site <- read_html(link)
  if(!is.na(site)) { 
  
  date.publish <- site %>% 
    html_nodes(".date-publish") %>%
    html_text() %>% 
    enframe(name=NULL, value="date.publish")
  
  decision.name <- site %>%
    html_nodes(".name") %>%
    html_text %>% 
    enframe(name=NULL, value="decision.name")
 # print(decision.name)
  
  bind_cols(date.publish=date.publish, decision.name=decision.name)

  }
}                

pb <- progress_estimated(nrow(df_links))

#map function
df_scrap_results <- df_links$link %>% 
  set_names() %>% 
  map_dfr(., possibly(scr_bonn, otherwise=NULL, quiet=FALSE), .id="link_to_page")
#> http://www.ohr.int/?cat=350&paged=1
#> http://www.ohr.int/?cat=350&paged=2
#> http://www.ohr.int/?cat=350&paged=3
#> http://www.ohr.int/?cat=350&paged=4
#> http://www.ohr.int/?cat=350&paged=5
#> http://www.ohr.int/?cat=358&paged=1
#> http://www.ohr.int/?cat=358&paged=2
#> http://www.ohr.int/?cat=358&paged=3
#> Error: HTTP error 404.
#> http://www.ohr.int/?cat=358&paged=4
#> Error: HTTP error 404.
#> http://www.ohr.int/?cat=358&paged=5
#> Error: HTTP error 404.
#> http://www.ohr.int/?cat=366&paged=1
#> http://www.ohr.int/?cat=366&paged=2
#> http://www.ohr.int/?cat=366&paged=3
#> http://www.ohr.int/?cat=366&paged=4
#> http://www.ohr.int/?cat=366&paged=5

#extract pages and categories from mapped link
df_scrap_results <- df_scrap_results %>% 
  mutate(page=str_extract(link_to_page, "(?<=paged\\=)[:digit:]+") %>% as.numeric) %>% 
  mutate(category=str_extract(link_to_page, "(?<=cat\\=)[:digit:]+") %>% as.numeric) 

#results for 'non-existant' links; category 358 has results for links/pages which throw an error
wrong <- df_scrap_results %>% 
  filter(page>2 & category=="358")
wrong
#> # A tibble: 12 x 5
#>    link_to_page       date.publish decision.name              page category
#>    <chr>              <chr>        <chr>                     <dbl>    <dbl>
#>  1 http://www.ohr.in~ 01/05/2011   Order Suspending the App~     4      358
#>  2 http://www.ohr.in~ 09/12/2009   OHR Inventory Team Estab~     4      358
#>  3 http://www.ohr.in~ 06/25/2008   Decision Amending the La~     4      358
#>  4 http://www.ohr.in~ 06/25/2008   Decision Amending the La~     4      358
#>  5 http://www.ohr.in~ 06/25/2008   Decision Amending the La~     4      358
#>  6 http://www.ohr.in~ 12/19/2007   Decision Amending the La~     4      358
#>  7 http://www.ohr.in~ 12/19/2007   Decision Amending the La~     4      358
#>  8 http://www.ohr.in~ 12/19/2007   Decision Amending the La~     4      358
#>  9 http://www.ohr.in~ 09/28/2007   Decision Amending the La~     4      358
#> 10 http://www.ohr.in~ 09/28/2007   Decision Amending the La~     4      358
#> 11 http://www.ohr.in~ 09/28/2007   Decision Amending the La~     4      358
#> 12 http://www.ohr.in~ 09/14/2007   Decision Withdrawing the~     4      358
Warning messages:
1: In .Internal(parent.frame(n)) :
  closing unused connection 5 (http://www.ohr.int/?cat=358&paged=5)
2: In .Internal(parent.frame(n)) :
  closing unused connection 4 (http://www.ohr.int/?cat=358&paged=4)
3: In .Internal(parent.frame(n)) :
  closing unused connection 3 (http://www.ohr.int/?cat=358&paged=3)

Created on 2019-05-07 by the reprex package (v0.2.1)

When I look into the results I see that it contains data for pages which actually do not exist. I know that category 358 has only 2 pages; nevertheless my iteration went up to the end of the created vector (5) and scrapped data which I haven't figured out where it is coming from. Instead of 'skipping' the excessive index/page numbers, it retrieved some data.

My understanding of the adverb of possibly was that in case the function throws an error, it skips the iteration and moves to the next instance. In this case, the function throws an error, but instead of filling in the NULL of 'otherwise' I get some data.

Any idea what's going on?

Running this as a reprex, I'm not getting any links beyond paged=2 for cat=358 in df_scrap_results (I deleted the top part of the reprex, since it's the same as yours).

df_scrap_results %>%
  filter(stringr::str_detect(link_to_page, "358")) %>%
  select(link_to_page)
#> # A tibble: 18 x 1
#>    link_to_page                       
#>    <chr>                              
#>  1 http://www.ohr.int/?cat=358&paged=1
#>  2 http://www.ohr.int/?cat=358&paged=1
#>  3 http://www.ohr.int/?cat=358&paged=1
#>  4 http://www.ohr.int/?cat=358&paged=1
#>  5 http://www.ohr.int/?cat=358&paged=1
#>  6 http://www.ohr.int/?cat=358&paged=1
#>  7 http://www.ohr.int/?cat=358&paged=1
#>  8 http://www.ohr.int/?cat=358&paged=1
#>  9 http://www.ohr.int/?cat=358&paged=1
#> 10 http://www.ohr.int/?cat=358&paged=1
#> 11 http://www.ohr.int/?cat=358&paged=1
#> 12 http://www.ohr.int/?cat=358&paged=1
#> 13 http://www.ohr.int/?cat=358&paged=2
#> 14 http://www.ohr.int/?cat=358&paged=2
#> 15 http://www.ohr.int/?cat=358&paged=2
#> 16 http://www.ohr.int/?cat=358&paged=2
#> 17 http://www.ohr.int/?cat=358&paged=2
#> 18 http://www.ohr.int/?cat=358&paged=2

Created on 2019-05-07 by the reprex package (v0.2.1)

Many thanks! I am looking for this emoji which scratches its head.
No idea what's going on. Will try on another machine later and update.

I noticed however if i use map instead of map_dfr I do not encounter the problem.

df_scrap_results_list <- df_links$link %>% 
  set_names() %>% 
  map(., possibly(scr_bonn, otherwise=NULL, quiet=FALSE))

df_scrap_results <- df_scrap_results_list %>% 
  map_df(., bind_rows, .id="link_to_page")

I ran the code on a different computer and the problem remains.

As far as I can tell the problem originates from usingmap_df(r) to map the function. If use map and subsequently map_df(., bind_rows) there is no problem.

It seems that the (row-)binding which is - as far as I understood - 'included' in map_df(r) binds some data even if included function produced a NULL. Evidently, there is no drawback to use map + map_df(., bind_rows), but I would lie if I would say I have fully understood why map_df produces this wrong result. In any case it seems quite a dangerous behavior to me.

Below a repex which hopefully makes the problem clearer.
Many thanks again.

``` r
library(rvest)
#> Warning: Paket 'rvest' wurde unter R Version 3.5.3 erstellt
#> Lade nötiges Paket: xml2
library(tidyverse)
#> Warning: Paket 'ggplot2' wurde unter R Version 3.5.3 erstellt
#> Warning: Paket 'tibble' wurde unter R Version 3.5.3 erstellt
#> Warning: Paket 'tidyr' wurde unter R Version 3.5.3 erstellt
#> Warning: Paket 'readr' wurde unter R Version 3.5.2 erstellt
#> Warning: Paket 'purrr' wurde unter R Version 3.5.3 erstellt
#> Warning: Paket 'dplyr' wurde unter R Version 3.5.2 erstellt
#> Warning: Paket 'forcats' wurde unter R Version 3.5.2 erstellt
library(glue)
#> Warning: Paket 'glue' wurde unter R Version 3.5.3 erstellt
#> 
#> Attache Paket: 'glue'
#> The following object is masked from 'package:dplyr':
#> 
#>     collapse


#here I create the vector; I know that category 358 has less than 5 pages;
seq_categories <- c(350, 358, 366)
seq_pages <- seq(1, 5, 1)
df_seq <- expand.grid(seq_pages=seq_pages, seq_categories=seq_categories)
df_links <- df_seq %>%
  mutate(link=glue("http://www.ohr.int/?cat={seq_categories}&paged={seq_pages}"))

#define function
scr_bonn <- function(link_input) {
  
  pb$tick()$print()
  
  print(link_input)
  
  site <- read_html(link_input)
  if(!is.na(site)) { 
  
  date.publish <- site %>% 
    html_nodes(".date-publish") %>%
    html_text() %>% 
    enframe(name=NULL, value="date.publish")
  
  decision.name <- site %>%
    html_nodes(".name") %>%
    html_text %>% 
    enframe(name=NULL, value="decision.name")

    
  bind_cols(date.publish=date.publish, decision.name=decision.name)

  }
}                



# CORRECT RESULTS WITH MAP + MAP_DF ---------------------------------------
pb <- progress_estimated(nrow(df_links))

#map function
list_correct<- df_links$link %>% 
  set_names() %>% 
  map(., possibly(scr_bonn, otherwise=NULL, quiet=FALSE))
#> http://www.ohr.int/?cat=350&paged=1
#> http://www.ohr.int/?cat=350&paged=2
#> http://www.ohr.int/?cat=350&paged=3
#> http://www.ohr.int/?cat=350&paged=4
#> http://www.ohr.int/?cat=350&paged=5
#> http://www.ohr.int/?cat=358&paged=1
#> http://www.ohr.int/?cat=358&paged=2
#> http://www.ohr.int/?cat=358&paged=3
#> Error: HTTP error 404.
#> http://www.ohr.int/?cat=358&paged=4
#> Error: HTTP error 404.
#> http://www.ohr.int/?cat=358&paged=5
#> Error: HTTP error 404.
#> http://www.ohr.int/?cat=366&paged=1
#> http://www.ohr.int/?cat=366&paged=2
#> http://www.ohr.int/?cat=366&paged=3
#> http://www.ohr.int/?cat=366&paged=4
#> http://www.ohr.int/?cat=366&paged=5
  #map_df(., possibly(scr_bonn, otherwise=NULL, quiet=FALSE), .id="link_to_page")

#bind tibbles in lists with bind_rows
df_correct <- list_correct %>% 
   map_df(., bind_rows, .id="link_to_page") %>% 
   mutate(page=str_extract(link_to_page, "(?<=paged\\=)[:digit:]+") %>% as.numeric) %>% 
   mutate(category=str_extract(link_to_page, "(?<=cat\\=)[:digit:]+") %>% as.numeric) 

df_correct %>% 
  filter(page>2 & category=="358")
#> # A tibble: 0 x 5
#> # ... with 5 variables: link_to_page <chr>, date.publish <chr>,
#> #   decision.name <chr>, page <dbl>, category <dbl>


# WRONG RESULTS WHEN USING MAP_DF IMMEDIATELY -----------------------------
pb <- progress_estimated(nrow(df_links))

df_false <- df_links$link %>% 
  set_names() %>% 
  map_df(., possibly(scr_bonn, otherwise=NULL, quiet=FALSE), .id="link_to_page") %>% 
  mutate(page=str_extract(link_to_page, "(?<=paged\\=)[:digit:]+") %>% as.numeric) %>% 
  mutate(category=str_extract(link_to_page, "(?<=cat\\=)[:digit:]+") %>% as.numeric) 
#> http://www.ohr.int/?cat=350&paged=1
#> http://www.ohr.int/?cat=350&paged=2
#> http://www.ohr.int/?cat=350&paged=3
#> http://www.ohr.int/?cat=350&paged=4
#> http://www.ohr.int/?cat=350&paged=5
#> http://www.ohr.int/?cat=358&paged=1
#> http://www.ohr.int/?cat=358&paged=2
#> http://www.ohr.int/?cat=358&paged=3
#> Error: HTTP error 404.
#> http://www.ohr.int/?cat=358&paged=4
#> Error: HTTP error 404.
#> http://www.ohr.int/?cat=358&paged=5
#> Error: HTTP error 404.
#> http://www.ohr.int/?cat=366&paged=1
#> http://www.ohr.int/?cat=366&paged=2
#> http://www.ohr.int/?cat=366&paged=3
#> http://www.ohr.int/?cat=366&paged=4
#> http://www.ohr.int/?cat=366&paged=5

#reports results for links which actually have no data to extracct
#http://www.ohr.int/?cat=358&paged=4
wrong <- df_false %>% 
  filter(page>2 & category=="358")
wrong
#> # A tibble: 12 x 5
#>    link_to_page       date.publish decision.name              page category
#>    <chr>              <chr>        <chr>                     <dbl>    <dbl>
#>  1 http://www.ohr.in~ 01/05/2011   Order Suspending the App~     4      358
#>  2 http://www.ohr.in~ 09/12/2009   OHR Inventory Team Estab~     4      358
#>  3 http://www.ohr.in~ 06/25/2008   Decision Amending the La~     4      358
#>  4 http://www.ohr.in~ 06/25/2008   Decision Amending the La~     4      358
#>  5 http://www.ohr.in~ 06/25/2008   Decision Amending the La~     4      358
#>  6 http://www.ohr.in~ 12/19/2007   Decision Amending the La~     4      358
#>  7 http://www.ohr.in~ 12/19/2007   Decision Amending the La~     4      358
#>  8 http://www.ohr.in~ 12/19/2007   Decision Amending the La~     4      358
#>  9 http://www.ohr.in~ 09/28/2007   Decision Amending the La~     4      358
#> 10 http://www.ohr.in~ 09/28/2007   Decision Amending the La~     4      358
#> 11 http://www.ohr.in~ 09/28/2007   Decision Amending the La~     4      358
#> 12 http://www.ohr.in~ 09/14/2007   Decision Withdrawing the~     4      358

#shows that data wrongly assigned to links were actually retrieved from subsequent(!) links in the vector/loop
wrong_in_correct_results <- df_correct %>% 
  filter(decision.name %in% wrong$decision.name) 

wrong_in_correct_results%>% 
  select(link_to_page, category, page)
#> # A tibble: 18 x 3
#>    link_to_page                        category  page
#>    <chr>                                  <dbl> <dbl>
#>  1 http://www.ohr.int/?cat=366&paged=1      366     1
#>  2 http://www.ohr.int/?cat=366&paged=1      366     1
#>  3 http://www.ohr.int/?cat=366&paged=1      366     1
#>  4 http://www.ohr.int/?cat=366&paged=1      366     1
#>  5 http://www.ohr.int/?cat=366&paged=1      366     1
#>  6 http://www.ohr.int/?cat=366&paged=1      366     1
#>  7 http://www.ohr.int/?cat=366&paged=1      366     1
#>  8 http://www.ohr.int/?cat=366&paged=1      366     1
#>  9 http://www.ohr.int/?cat=366&paged=1      366     1
#> 10 http://www.ohr.int/?cat=366&paged=1      366     1
#> 11 http://www.ohr.int/?cat=366&paged=1      366     1
#> 12 http://www.ohr.int/?cat=366&paged=1      366     1
#> 13 http://www.ohr.int/?cat=366&paged=2      366     2
#> 14 http://www.ohr.int/?cat=366&paged=2      366     2
#> 15 http://www.ohr.int/?cat=366&paged=2      366     2
#> 16 http://www.ohr.int/?cat=366&paged=2      366     2
#> 17 http://www.ohr.int/?cat=366&paged=2      366     2
#> 18 http://www.ohr.int/?cat=366&paged=2      366     2

Created on 2019-05-08 by the reprex package (v0.2.1)

OK, here's a full reprex of the code you have above, but I am getting different results. Are your packages up-to-date?

library(rvest)
#> Loading required package: xml2
library(tidyverse)
library(glue)
#> 
#> Attaching package: 'glue'
#> The following object is masked from 'package:dplyr':
#> 
#>     collapse

#here I create the vector; I know that category 358 has less than 5 pages;
seq_categories <- c(350, 358, 366)
seq_pages <- seq(1, 5, 1)
df_seq <- expand.grid(seq_pages=seq_pages, seq_categories=seq_categories)
df_links <- df_seq %>%
  mutate(link=glue("http://www.ohr.int/?cat={seq_categories}&paged={seq_pages}"))

#define function
scr_bonn <- function(link_input) {
  
  pb$tick()$print()
  
  print(link_input)
  
  site <- read_html(link_input)
  if(!is.na(site)) { 
    
    date.publish <- site %>% 
      html_nodes(".date-publish") %>%
      html_text() %>% 
      enframe(name=NULL, value="date.publish")
    
    decision.name <- site %>%
      html_nodes(".name") %>%
      html_text %>% 
      enframe(name=NULL, value="decision.name")
    
    
    bind_cols(date.publish=date.publish, decision.name=decision.name)
    
  }
}                



# CORRECT RESULTS WITH MAP + MAP_DF ---------------------------------------
pb <- progress_estimated(nrow(df_links))

#map function
list_correct<- df_links$link %>% 
  set_names() %>% 
  map(., possibly(scr_bonn, otherwise=NULL, quiet=FALSE))
#> http://www.ohr.int/?cat=350&paged=1
#> http://www.ohr.int/?cat=350&paged=2
#> http://www.ohr.int/?cat=350&paged=3
#> http://www.ohr.int/?cat=350&paged=4
#> http://www.ohr.int/?cat=350&paged=5
#> http://www.ohr.int/?cat=358&paged=1
#> http://www.ohr.int/?cat=358&paged=2
#> http://www.ohr.int/?cat=358&paged=3
#> Error: HTTP error 404.
#> http://www.ohr.int/?cat=358&paged=4
#> Error: HTTP error 404.
#> http://www.ohr.int/?cat=358&paged=5
#> Error: HTTP error 404.
#> http://www.ohr.int/?cat=366&paged=1
#> http://www.ohr.int/?cat=366&paged=2
#> http://www.ohr.int/?cat=366&paged=3
#> http://www.ohr.int/?cat=366&paged=4
#> http://www.ohr.int/?cat=366&paged=5

df_correct <- list_correct %>% 
  map_df(., bind_rows, .id="link_to_page") %>% 
  mutate(page=str_extract(link_to_page, "(?<=paged\\=)[:digit:]+") %>% as.numeric) %>% 
  mutate(category=str_extract(link_to_page, "(?<=cat\\=)[:digit:]+") %>% as.numeric) 

df_correct %>% 
  filter(page>2 & category=="358")
#> # A tibble: 0 x 5
#> # … with 5 variables: link_to_page <chr>, date.publish <chr>,
#> #   decision.name <chr>, page <dbl>, category <dbl>



# WRONG RESULTS WHEN USING MAP_DF IMMEDIATELY -----------------------------
pb <- progress_estimated(nrow(df_links))

df_false <- df_links$link %>% 
  set_names() %>% 
  map_df(., possibly(scr_bonn, otherwise=NULL, quiet=FALSE), .id="link_to_page") %>% 
  mutate(page=str_extract(link_to_page, "(?<=paged\\=)[:digit:]+") %>% as.numeric) %>% 
  mutate(category=str_extract(link_to_page, "(?<=cat\\=)[:digit:]+") %>% as.numeric) 
#> http://www.ohr.int/?cat=350&paged=1
#> http://www.ohr.int/?cat=350&paged=2
#> http://www.ohr.int/?cat=350&paged=3
#> http://www.ohr.int/?cat=350&paged=4
#> http://www.ohr.int/?cat=350&paged=5
#> http://www.ohr.int/?cat=358&paged=1
#> http://www.ohr.int/?cat=358&paged=2
#> http://www.ohr.int/?cat=358&paged=3
#> Error: HTTP error 404.
#> http://www.ohr.int/?cat=358&paged=4
#> Error: HTTP error 404.
#> http://www.ohr.int/?cat=358&paged=5
#> Error: HTTP error 404.
#> http://www.ohr.int/?cat=366&paged=1
#> http://www.ohr.int/?cat=366&paged=2
#> http://www.ohr.int/?cat=366&paged=3
#> http://www.ohr.int/?cat=366&paged=4
#> http://www.ohr.int/?cat=366&paged=5

wrong <- df_false %>% 
  filter(page>2 & category=="358")
wrong
#> # A tibble: 0 x 5
#> # … with 5 variables: link_to_page <chr>, date.publish <chr>,
#> #   decision.name <chr>, page <dbl>, category <dbl>

wrong_in_correct_results <- df_correct %>% 
  filter(decision.name %in% wrong$decision.name) 

wrong_in_correct_results%>% 
  select(link_to_page, category, page)
#> # A tibble: 0 x 3
#> # … with 3 variables: link_to_page <chr>, category <dbl>, page <dbl>

Created on 2019-05-08 by the reprex package (v0.2.1)

Many thanks. This is all very strange. All packages arer updated, ran it on R3.6 and R.3.5.1 (all Windows).

I checked now also if it made difference if I don't set the otherwise option to NULL but something different. And it does. NULL leads to the wrong shift of data. Rows which should be missing are filled with data from the subsequent pages (category 366).

Here I specify a tibble to be filled if the otherwise option is triggered and the missing pages are correctly identified.

pb <- progress_estimated(nrow(df_links))

#otherwise with a tibble instead of NULL
df_tibble <- df_links$link %>% 
  set_names() %>% 
  purrr::map_df(., possibly(scr_bonn, otherwise=tibble(decision.name="missing"), quiet=FALSE), .id="link_to_page") %>% 
  mutate(page=str_extract(link_to_page, "(?<=paged\\=)[:digit:]+") %>% as.numeric) %>% 
  mutate(category=str_extract(link_to_page, "(?<=cat\\=)[:digit:]+") %>% as.numeric) 
#> http://www.ohr.int/?cat=350&paged=1
#> http://www.ohr.int/?cat=350&paged=2
#> http://www.ohr.int/?cat=350&paged=3
#> http://www.ohr.int/?cat=350&paged=4
#> http://www.ohr.int/?cat=350&paged=5
#> http://www.ohr.int/?cat=358&paged=1
#> http://www.ohr.int/?cat=358&paged=2
#> http://www.ohr.int/?cat=358&paged=3
#> Error: HTTP error 404.
#> http://www.ohr.int/?cat=358&paged=4
#> Error: HTTP error 404.
#> http://www.ohr.int/?cat=358&paged=5
#> Error: HTTP error 404.
#> http://www.ohr.int/?cat=366&paged=1
#> http://www.ohr.int/?cat=366&paged=2
#> http://www.ohr.int/?cat=366&paged=3
#> http://www.ohr.int/?cat=366&paged=4
#> http://www.ohr.int/?cat=366&paged=5

#actually correct
df_tibble %>% 
  filter(category==358 & page > 2)
#> # A tibble: 3 x 5
#>   link_to_page                    date.publish decision.name  page category
#>   <chr>                           <chr>        <chr>         <dbl>    <dbl>
#> 1 http://www.ohr.int/?cat=358&pa~ <NA>         missing           3      358
#> 2 http://www.ohr.int/?cat=358&pa~ <NA>         missing           4      358
#> 3 http://www.ohr.int/?cat=358&pa~ <NA>         missing           5      358

df_tibble %>% 
  filter(category==366) %>% 
  nrow()
#> [1] 60
#60

If I specifiy (as in the original example) NULL the rows are wrongly shifted and missing from the subsequent category.

pb <- progress_estimated(nrow(df_links))

df_NULL <- df_links$link %>% 
  set_names() %>% 
  purrr::map_df(., possibly(scr_bonn, otherwise=NULL, quiet=FALSE), .id="link_to_page") %>% 
  mutate(page=str_extract(link_to_page, "(?<=paged\\=)[:digit:]+") %>% as.numeric) %>% 
  mutate(category=str_extract(link_to_page, "(?<=cat\\=)[:digit:]+") %>% as.numeric) 
#> http://www.ohr.int/?cat=350&paged=1
#> http://www.ohr.int/?cat=350&paged=2
#> http://www.ohr.int/?cat=350&paged=3
#> http://www.ohr.int/?cat=350&paged=4
#> http://www.ohr.int/?cat=350&paged=5
#> http://www.ohr.int/?cat=358&paged=1
#> http://www.ohr.int/?cat=358&paged=2
#> http://www.ohr.int/?cat=358&paged=3
#> Error: HTTP error 404.
#> http://www.ohr.int/?cat=358&paged=4
#> Error: HTTP error 404.
#> http://www.ohr.int/?cat=358&paged=5
#> Error: HTTP error 404.
#> http://www.ohr.int/?cat=366&paged=1
#> http://www.ohr.int/?cat=366&paged=2
#> http://www.ohr.int/?cat=366&paged=3
#> http://www.ohr.int/?cat=366&paged=4
#> http://www.ohr.int/?cat=366&paged=5

df_NULL %>% 
  filter(category==358 & page > 2)
#> # A tibble: 12 x 5
#>    link_to_page       date.publish decision.name              page category
#>    <chr>              <chr>        <chr>                     <dbl>    <dbl>
#>  1 http://www.ohr.in~ 01/05/2011   Order Suspending the App~     4      358
#>  2 http://www.ohr.in~ 09/12/2009   OHR Inventory Team Estab~     4      358
#>  3 http://www.ohr.in~ 06/25/2008   Decision Amending the La~     4      358
#>  4 http://www.ohr.in~ 06/25/2008   Decision Amending the La~     4      358
#>  5 http://www.ohr.in~ 06/25/2008   Decision Amending the La~     4      358
#>  6 http://www.ohr.in~ 12/19/2007   Decision Amending the La~     4      358
#>  7 http://www.ohr.in~ 12/19/2007   Decision Amending the La~     4      358
#>  8 http://www.ohr.in~ 12/19/2007   Decision Amending the La~     4      358
#>  9 http://www.ohr.in~ 09/28/2007   Decision Amending the La~     4      358
#> 10 http://www.ohr.in~ 09/28/2007   Decision Amending the La~     4      358
#> 11 http://www.ohr.in~ 09/28/2007   Decision Amending the La~     4      358
#> 12 http://www.ohr.in~ 09/14/2007   Decision Withdrawing the~     4      358

df_NULL %>% 
  filter(category==366) %>% 
  nrow()
#> [1] 48
#48 = 12 are missing => are wrongly assigned to the preceeding category where NULL was triggered

Here my session info

> R.version
               _                           
platform       i386-w64-mingw32            
arch           i386                        
os             mingw32                     
system         i386, mingw32               
status                                     
major          3                           
minor          5.1                         
year           2018                        
month          07                          
day            02                          
svn rev        74947                       
language       R                           
version.string R version 3.5.1 (2018-07-02)
nickname       Feather Spray               
> sessionInfo()
R version 3.5.1 (2018-07-02)
Platform: i386-w64-mingw32/i386 (32-bit)
Running under: Windows >= 8 x64 (build 9200)

Matrix products: default

locale:
[1] LC_COLLATE=German_Germany.1252  LC_CTYPE=German_Germany.1252    LC_MONETARY=German_Germany.1252 LC_NUMERIC=C                   
[5] LC_TIME=German_Germany.1252    

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
 [1] glue_1.3.1      forcats_0.4.0   stringr_1.3.1   dplyr_0.8.0.1   purrr_0.3.2     readr_1.3.1     tidyr_0.8.3     tibble_2.1.1   
 [9] ggplot2_3.1.1   tidyverse_1.2.1 rvest_0.3.3     xml2_1.2.0     

loaded via a namespace (and not attached):
 [1] withr_2.1.2      ps_1.3.0         tidyselect_0.2.5 lattice_0.20-35  pkgconfig_2.0.2  reprex_0.2.1     utf8_1.1.4      
 [8] compiler_3.5.1   fs_1.3.1         readxl_1.3.1     Rcpp_1.0.1       cli_1.1.0        plyr_1.8.4       cellranger_1.1.0
[15] httr_1.4.0       tools_3.5.1      nlme_3.1-137     broom_0.5.2      rmarkdown_1.12   R6_2.4.0         knitr_1.22      
[22] selectr_0.4-1    scales_1.0.0     digest_0.6.18    assertthat_0.2.0 curl_3.3         evaluate_0.13    gtable_0.2.0    
[29] fansi_0.4.0      stringi_1.4.3    rstudioapi_0.10  whisker_0.3-2    htmltools_0.3.6  backports_1.1.2  hms_0.4.2       
[36] munsell_0.5.0    grid_3.5.1       colorspace_1.3-2 lubridate_1.7.4  rlang_0.3.4      processx_3.3.0   clipr_0.6.0     
[43] callr_3.2.0      magrittr_1.5     generics_0.0.2   lazyeval_0.2.2   yaml_2.2.0       xfun_0.6         crayon_1.3.4    
[50] haven_2.1.0      modelr_0.1.4     pillar_1.3.1     jsonlite_1.6

I am fine with using the approach which works. But I am baffled by the behavior and can't figure out why this is happening.

Many thanks again for your help!

This topic was automatically closed 21 days after the last reply. New replies are no longer allowed.