Is it possible to get info from address bar (but not actual URL)?

TLDR: I have a data-set of links that give me 404 errors, but there's a useful URL in the address bar that comes when I get the 404 error. Can I access that "useful URL" in R?

I'm trying to scrape data from a webpage, but I'm (understandably) getting a 404 error for the URLs below. However, there's data from the 404 link that I'm trying to get from within the browser. Here's the example:

library(tidyverse)
library(rvest)

url <- "http://www.uscho.com/scoreboard/division-i-men/20172018/composite-schedule/"

link_list <- url %>%
  read_html() %>%
  html_nodes("td:nth-child(13) a") %>%
  html_attr("href") %>%
  {paste0("http://www.uscho.com", .)}

Now, for example, search the 200th link here (http://www.uscho.com/recaplink.php?gid=1_970_20172018) in your web browser. You'll get this:

I don't actually want to get a 404 Error, but in the address bar, there's a URL that -- after some manipulation -- I can use to get the actual webpage that I want ("https://www.uscho.com/recaps/?p=171810970")

This URL, however, doesn't show up in R anywhere from what I can tell. Running read_html(link_list[200]), I only get a 404 error.

Any idea how I can get the URL from the browser within R?

FYI I asked this question on stack exchange earlier, but chances are it won't get answered there, and I thought this may be a better place to ask.

Actually there may be a pattern to find the "good" link that's a lot easier than any other way

And... there isn't really a pattern. Oh well

maybe with {RSelenium} :package: you could get what you want

You have a method getCurrentUrl() in the remote driver object.

:package: {SeleniumPipes} is a pipe-friendly helper for RSelenium:

From docs

get the current page url

Piped Non Piped
remDr %>% go("http://www.bbc.co.uk") %>% getCurrentUrl getCurrentUrl(go(remDr, "http://www.bbc.co.uk"))
1 Like

Thanks for the response! I'll try that. By the way, are you familiar with splashr at all? I finally figured how to get that working (still haven't quite figured out RSelenium). Do you know of any similar method for getCurrentUrl() in splashr?

If you read the doc, it seems that there is:
http://splash.readthedocs.io/en/stable/scripting-ref.html#splash-url

splash:url

Signature: url = splash:url()

Returns: the current URL.

Async: no.

however, not sure how it is implemented in splashr :package: . if not, it could be an opportunity to add this feature with a PR.

1 Like

Thanks @cderv! I'll look into that as soon as I can. I'm gonna try to see if the RSelenium solution works, but I'm having quite the time trying to get RSelenium set up...

Thanks again!

I think using the docker setup is the easiest way!
http://rpubs.com/johndharrison/RSelenium-Docker

That way you "just" download the image, run the container as a server and connect R to this server on your local machine.
Last time I used it, it was with docker.

1 Like