Trying to scrape a vector of URLs

mwoodhouse · October 14, 2020, 4:39am

Hi everyone! First time posting here. I have a vector of URLs, and I'm hoping to scrape each of them and get a vector - not a list - of the results. Anyway, here's a small portion of the URL vector:

links_short <- c("https://fronterasdesk.org/content/1619790/response-criticism-mexican-president-says-sonora-getting-adequate-federal-support", "https://fronterasdesk.org/content/1619779/mexican-security-head-highlights-sonoran-violence-hotspots-drop-kidnapping", "https://fronterasdesk.org/content/1619264/hermosillo-looks-grow-recycling-pepenadores-hope-preserve-their-role")

I've tried a lot of things, but the best I'm able to do is get a list returned, not a vector. Here's the code I used to do that:

map(links_short, function(x) {
  page <- read_html(x)
  sizehtml <- html_nodes(page, ".file-size")
  size <- html_text(sizehtml)
  str_replace_all(size, c("\\(" = "", 
                          "\\)" = "",
                          " " = "",
                          "MB" = ""))
})

My end goal is to be able to add a column of the results to a dataframe with those URLs, but have just been playing around with a vector of the URLs to learn. Any insight is appreciated! Thanks!

SebNeu · October 14, 2020, 7:58am

Maybe you want something like this. There is a problem with the first two urls, since the pages do not have a node of class file-size.

That's the reason for the [3] in tibble(url = links_short[3])

library(tidyverse)
library(rvest)
library(stringr)

links_short <- c("https://fronterasdesk.org/content/1619790/response-criticism-mexican-president-says-sonora-getting-adequate-federal-support", "https://fronterasdesk.org/content/1619779/mexican-security-head-highlights-sonoran-violence-hotspots-drop-kidnapping", "https://fronterasdesk.org/content/1619264/hermosillo-looks-grow-recycling-pepenadores-hope-preserve-their-role")


tibble(url = links_short[3]) %>% 
    rowwise() %>% 
    mutate(content = read_html(url) %>% 
                    html_nodes(".file-size") %>% 
                    html_text())
#> # A tibble: 1 x 2
#> # Rowwise: 
#>   url                                                                   content 
#>   <chr>                                                                 <chr>   
#> 1 https://fronterasdesk.org/content/1619264/hermosillo-looks-grow-recy~ (6.41 M~

mwoodhouse · October 14, 2020, 5:52pm

Thanks so much for responding! That seems like a promising approach! I should have mentioned it in the previous post, but not all of the URLs are going to have the ".file.size" class, but I'll still want to include the result in the new column. However, when I run this on the full links_short object...

tibble(url = links_short) %>%
  rowwise() %>% 
  mutate(content = read_html(url) %>% 
           html_nodes(".file-size") %>% 
           html_text())

... I get this error: "Error: Column content must be length 1 (the group size), not 0"

It works fine when I run it on just the element that has that class, but not those without. Is there a way to do this when not all URLs will have this feature, or is that going to be a hard roadblock? Or maybe there's a way to first filter out URLs without it? Anyway, again really appreciate the help!

mwoodhouse · October 14, 2020, 5:58pm

A potentially interesting addendum to my last response: even when I change the links_short object to two URLs that each have that class...

links_short <- c("https://fronterasdesk.org/content/1619790/response-criticism-mexican-president-says-sonora-getting-adequate-federal-support", "https://fronterasdesk.org/content/1605804/pandemic-drags-sonoran-businesses-face-tough-choices?_ga=2.198604975.1838608566.1602645661-2029265652.1585419609")

... I still get the same error when I run this:

tibble(url = links_short) %>%
  rowwise() %>% 
  mutate(content = read_html(url) %>% 
           html_nodes(".file-size") %>% 
           html_text())

SebNeu · October 15, 2020, 7:50am

try using html_node instead of html_nodes

as the help says:

html_node vs html_nodes
html_node is like [[ it always extracts exactly one element. When given a list of nodes, html_node will always return a list of the same length, the length of html_nodes might be longer or shorter.

I assume in your case you can be sure that the html page has no more than one file-size nodes.

Each url without the class file-size will generate a NA. All others will result in the size.

library(tidyverse)
library(rvest)
library(stringr)

links_short <- c("https://fronterasdesk.org/content/1619790/response-criticism-mexican-president-says-sonora-getting-adequate-federal-support", "https://fronterasdesk.org/content/1619779/mexican-security-head-highlights-sonoran-violence-hotspots-drop-kidnapping", "https://fronterasdesk.org/content/1619264/hermosillo-looks-grow-recycling-pepenadores-hope-preserve-their-role")


tibble(url = links_short) %>% 
    rowwise() %>% 
    mutate(content = read_html(url) %>% 
                    html_node(".file-size") %>% 
                    html_text() %>%
                    str_replace_all(c("\\(" = "", "\\)" = "", " " = "", "MB" = "")) %>% 
                    as.double() %>% 
                    replace_na(0)
                 ) %>%
    ungroup()
#> # A tibble: 3 x 2
#>   url                                                                    content
#>   <chr>                                                                    <dbl>
#> 1 https://fronterasdesk.org/content/1619790/response-criticism-mexican-~    0   
#> 2 https://fronterasdesk.org/content/1619779/mexican-security-head-highl~    0   
#> 3 https://fronterasdesk.org/content/1619264/hermosillo-looks-grow-recyc~    6.41

I am not sure if you prefer NAs or 0s for missing values. I added the convertion to 0 (and double) - remove the %>% replace_na(0) part to keep NAs.
I added an ungroup() in case you want to work with the data and are not familiar rowwised tibbles.

Regards
Seb

mwoodhouse · October 15, 2020, 4:53pm

This looks even more promising. Can't wait to try it! Yeah, I was unfamiliar with rowwise, and clearly need to do some reading up on it. Many thanks again! This is great!

mwoodhouse · October 21, 2020, 4:26pm

Worked like a charm! Went through over 3,000 rows just fine!

system · October 28, 2020, 4:27pm

This topic was automatically closed 7 days after the last reply. New replies are no longer allowed.

If you have a query related to it or one of the replies, start a new topic and refer back with a link.