Ignore Broken Website in Rvest


I have 20 different links and I want to scrap them with rvest.

map_df(links, function(link){
  pg <- read_html(link)
  df=pg %>% html_nodes("ul > li") %>% html_text(trim=T) %>% as.data.frame()

For example 15. link is broken and it does not work. How can my formula ignore and pass 15. link because loop stop at 15. link (broken link)?

Thank you!!

This is the exact use case for tryCatch().

The basic idea is, you tell R to try to do something. If there is an error, instead of breaking your code, you get to tell it what to do instead.

Say we had a function called self_extract() which takes a vector and extracts the elements of that vector in indices referenced by that vector, but it throws errors under certain conditions.

df <- data.frame(id = sample(100:200, 10), x = sample(10, 10, TRUE), y = rnorm(10), z = sample(c(TRUE, FALSE), 10, TRUE))

self_extract <- function(x) {
  stopifnot(class(x) %in% c("integer", "logical"),
            all(x) >= 0,
            max(x) <= length(x))

Now, say we wanted to do this on every column in a data.frame, it might not be able to work on every data.frame.

sapply(df, self_extract)
#> Error in FUN(X[[i]], ...): max(x) <= length(x) is not TRUE

But, if it's critically important we get them for the variables we can. That's where tryCatch() comes in, it allows us to define an alternate behavior for when an error is encountered. Here, we will just capture the error text and move on.

       function(x) {
         }, error = function(e) {
#> $id
#> [1] "Error in self_extract(x): max(x) <= length(x) is not TRUE\n"
#> $x
#>  [1]  7  9  3 10 10 10  3  8  7  3
#> $y
#> [1] "Error in self_extract(x): class(x) %in% c(\"integer\", \"logical\") is not TRUE\n"
#> $z

Created on 2020-09-02 by the reprex package (v0.3.0)

1 Like

This topic was automatically closed 7 days after the last reply. New replies are no longer allowed.

If you have a query related to it or one of the replies, start a new topic and refer back with a link.