What's the most interesting use of rvest you've seen in the wild?

In this SO answer, I leverage other tidyverse packages to make scraping a little more robust (with RETRY in case of failure) and a little more gentle on the server (with basic Sys.sleep() calls): https://stackoverflow.com/questions/43218761

library(purrr)
library(dplyr)
library(httr)
library(xml2)
library(rvest)

safe_retry_read_html <- 
  possibly(~ read_html(RETRY("GET", url = .x)), 
           otherwise = read_html("<html></html>"))

links <- c("https://www.ratebeer.com/beer/8481/",
           "https://www.ratebeer.com/beer/3228/",
           "https://www.ratebeer.com/beer/10325/")

links %>%
  c("https://www.wrong-url.foobar") %>% 
  purrr::set_names() %>% 
  map(~ {
    Sys.sleep(1 + runif(1))
    safe_retry_read_html(.x)
  }) %>%
  map(html_node, "#_brand4 span") %>%
  map_chr(html_text)

#  https://www.ratebeer.com/beer/8481/  https://www.ratebeer.com/beer/3228/ 
#                        "Föroya Bjór"               "King Brewing Company" 
# https://www.ratebeer.com/beer/10325/         https://www.wrong-url.foobar 
#                "Bavik-De Brabandere"                                   NA
6 Likes

Not the most interesting probably, but I found it incredibly useful for the "Media and Politics" class I taught during the 2016 US presidential campaign. I used it to to scrape debate transcripts and news stories, providing objective data to supplement/challenge students' personal perceptions of news coverage and bias.

And I found Julia Silge and David Robinson's book very useful for this endeavor.

Here's a figure from the debate 3 analysis:

@hadley, I used rvest to get data for this small article that I've just posted.
http://taraskaduk.com/2017/09/mpaa/

rmarkdown document on github: https://github.com/taraskaduk/taraskaduk/blob/master/public/posts/movies/mpaa.Rmd

2 Likes

Inspired by the question, I wrote up this one today. Another sport one (cricket, this time).

1 Like

I needed to scrape lots of tables of public health data, many tables recognized by their headers, in sub-sub pages
most interesting were solutions proposed to me on StackOverflow:



In one case I needed RSelenium to help rvest find its way (java dropdown invisible in html):

2 Likes

This just poped out in my twitter feed, and I think this is a really interesting good use case :slight_smile: http://giorasimchoni.com/2017/09/24/2017-09-24-where-my-girls-at/

4 Likes

This is great. Along the same lines, I'd written a quick scraper for MMA records via the fightmetric.com website:

### Setup
library(rvest)
library(stringr)
library(tidyr)
library(methods)
library(foreach)
library(magrittr)

### Source
foreach(n = letters, .packages="foreach") %do% {
  site <- paste('http://www.fightmetric.com/statistics/fighters?char=', '&page=all', sep=n) %>% read_html()
  fighter_table <- site %>% html_nodes('table')
  fighters_all <- do.call(rbind, list(fighters_all, html_table(fighter_table,fill=TRUE)[[1]]))
}

### Dedupe
fighters_all <- unique(fighters_all)

Well. Here is a function that brings down all public shares traded at any specified market run by NASDAQ OMX Nordic and also sorts out the hyperlinks included in a couple of the HTML-table columns. But it is not from the wild, it is from my C-drive.

getsharelist <- function(market="stockholm") {
    # market values allowed: 
    # "stockholm" 
    # "first-north"
    # "baltic"
    # "copenhagen"
    # "helsinki"
    # "iceland"
    # "nordic-large-cap"
    # "nordic-mid-cap"
    # "nordic-small-cap"
    # "norwegian-listed-shares"
    baseurl <- "http://www.nasdaqomxnordic.com"
    relurl <- paste0("shares/listed-companies/", market)
    absurl <- paste(baseurl, relurl, sep="/")
    htmlpage <- xml2::read_html(absurl)
    htmlpage %>% rvest::html_node("table#listedCompanies") -> shareshtmltable
    shareshtmltable %>% rvest::html_table(header=T) -> sharelist
    names(sharelist) <- tolower(names(sharelist))
    names(sharelist) <- gsub(" ", "_", names(sharelist))
    sharelist$fact_sheet <- NULL
    shareshtmltable %>%  rvest::html_nodes("td:nth-child(1) a") %>% html_attr("href") -> url_share_info
    sharelist$url_share_info <- paste0(baseurl, url_share_info)
    shareshtmltable %>%  rvest::html_nodes("td:nth-child(7) a") %>% html_attr("href") -> sharelist$url_factsheet
    marketname <- gsub("-", " ", market)
    marketname <- unlist(strsplit(marketname, " "))
    marketname <- paste(toupper(substring(marketname, 1,1)), substring(marketname, 2), sep="", collapse=" ")
    sharelist$market <- paste("NASDAQ", marketname)
    return(sharelist)
}

I've never used rvest but I'm hoping to learn how to use it so that I can scrape NHL scores from nhl.com daily to save me some time. I use the scores for an ELO based prediction model on a game by game basis, similar to what FiveThirtyEight does with baseball. Not terribly unusually but it is a very fun project to work on.

1 Like

I used rvest to catch pokemon in the Wild World Web

It isn't written up anywhere, but someone recently told me that cried tears of joy when I explained how for the dozens of organisational policies they were responsible for updating which only existed on the web, they could use rvest to assemble local copies and build a tabular arrangement of the metadata like original author & date due to be updated.

I'm only part way through the project, but I'm using it to scrape an online booking site for court availability at my local sport club at select times. If the court I want is free, it can then book it for me. The blue-sky plan is to then integrate with Twilio to send me a text to confirm the booking!

I'm using rvest in a way similar to @eric_bickel, to scrape Berkeley Earth temperature data. I scrape the links for countries from their HTML table, download the data and then match it with countries in other datasets using @drob's fuzzyjoin (although admittedly matching country names is not a great use for that package). It needs a lot more time in the oven, but I'm documenting the experiments as I go :slight_smile:

5 Likes

For standardizing country names, I recommend trying the countrycode package, which comes with a set of regexes for recognizing country names.

You can convert them into 2 character codes (then join on those) with:

library(countrycode)
cnames <- c("United States", "United States of America", "South Korea", "Korea, South")

countrycode(cnames, "country.name", "iso2c")

Though if you'd like to use fuzzyjoin, you can also use regex_inner_join() to join with countrycode::countrycode_data on the country.name.en.regex column. Good luck!

5 Likes

Yeah, I found that trying match country names with fuzzyjoin was difficult, since most of the algorithms are always going to preference words with substituted letters (eg. Ireland against Iceland) over phrases that are supersets of others ("United States of America" against "United States"). I didn't know that countrycode has regexes, though, so I'll check that out! :smiley:

http://taraskaduk.com/2017/09/do-mpaa-movie-ratings-mean-anything/

1 Like

This is pretty cool! To expand it more, how can you add random user agent data (Mozilla, Chrome, IE etc) to that safe_retry_read_html function? Thank you!