Is it possible to get info from address bar (but not actual URL)?

rvest

#1

TLDR: I have a data-set of links that give me 404 errors, but there's a useful URL in the address bar that comes when I get the 404 error. Can I access that "useful URL" in R?

I'm trying to scrape data from a webpage, but I'm (understandably) getting a 404 error for the URLs below. However, there's data from the 404 link that I'm trying to get from within the browser. Here's the example:

library(tidyverse)
library(rvest)

url <- "http://www.uscho.com/scoreboard/division-i-men/20172018/composite-schedule/"

link_list <- url %>%
  read_html() %>%
  html_nodes("td:nth-child(13) a") %>%
  html_attr("href") %>%
  {paste0("http://www.uscho.com", .)}

Now, for example, search the 200th link here (http://www.uscho.com/recaplink.php?gid=1_970_20172018) in your web browser. You'll get this:

I don't actually want to get a 404 Error, but in the address bar, there's a URL that -- after some manipulation -- I can use to get the actual webpage that I want ("https://www.uscho.com/recaps/?p=171810970")

This URL, however, doesn't show up in R anywhere from what I can tell. Running read_html(link_list[200]), I only get a 404 error.

Any idea how I can get the URL from the browser within R?

FYI I asked this question on stack exchange earlier, but chances are it won't get answered there, and I thought this may be a better place to ask.


#2

Actually there may be a pattern to find the "good" link that's a lot easier than any other way


#3

And... there isn't really a pattern. Oh well


#4

maybe with {RSelenium} :package: you could get what you want

You have a method getCurrentUrl() in the remote driver object.

:package: {SeleniumPipes} is a pipe-friendly helper for RSelenium:

From docs

get the current page url

Piped Non Piped
remDr %>% go("http://www.bbc.co.uk") %>% getCurrentUrl getCurrentUrl(go(remDr, "http://www.bbc.co.uk"))

#5

Thanks for the response! I'll try that. By the way, are you familiar with splashr at all? I finally figured how to get that working (still haven't quite figured out RSelenium). Do you know of any similar method for getCurrentUrl() in splashr?


#6

If you read the doc, it seems that there is:
http://splash.readthedocs.io/en/stable/scripting-ref.html#splash-url

splash:url

Signature: url = splash:url()

Returns: the current URL.

Async: no.

however, not sure how it is implemented in splashr :package: . if not, it could be an opportunity to add this feature with a PR.


#7

Thanks @cderv! I'll look into that as soon as I can. I'm gonna try to see if the RSelenium solution works, but I'm having quite the time trying to get RSelenium set up...

Thanks again!


#8

I think using the docker setup is the easiest way!
http://rpubs.com/johndharrison/RSelenium-Docker

That way you "just" download the image, run the container as a server and connect R to this server on your local machine.
Last time I used it, it was with docker.