Target span tags with multiple classes using rvest

rvest

#1

I’d like to scrape some data from LinkedIn Learning but have come up against a stumbling block in rvest: how can I filter by two HTML classes at once?

Let’s experiment with this page: https://www.linkedin.com/learning/r-statistics-essential-training?u=2125562. The number of viewers is displayed in the following span tag (if you’re not logged into LinkedIn learning)

<span class="content__info__item__value viewers">82,552</span>

Extracting all html_nodes with the class content__info__item__value yields an xml_nodeset with 4 different spans:

library("tidyverse")
#> ── Attaching packages ───────────────────────────────────────────────────────────────── tidyverse 1.2.1 ──
#> ✔ ggplot2 2.2.1.9000     ✔ purrr   0.2.4     
#> ✔ tibble  1.3.4          ✔ dplyr   0.7.4     
#> ✔ tidyr   0.7.2          ✔ stringr 1.2.0     
#> ✔ readr   1.1.1          ✔ forcats 0.2.0
#> ── Conflicts ──────────────────────────────────────────────────────────────────── tidyverse_conflicts() ──
#> ✖ dplyr::filter() masks stats::filter()
#> ✖ dplyr::lag()    masks stats::lag()
library("rvest")
#> Loading required package: xml2
#> 
#> Attaching package: 'rvest'
#> The following object is masked from 'package:purrr':
#> 
#>     pluck
#> The following object is masked from 'package:readr':
#> 
#>     guess_encoding

in_learning_url <- "https://www.linkedin.com/learning/r-statistics-essential-training?u=2125562"

in_learning_page <- read_html(in_learning_url)

in_learning_page %>%
  html_nodes(".content__info__item__value")
#> {xml_nodeset (4)}
#> [1] <span class="content__info__item__value duration">5h 59m 42s</span>
#> [2] <span class="content__info__item__value skill">Beginner + Intermedia ...
#> [3] <span class="content__info__item__value released">September 26, 2013 ...
#> [4] <span class="content__info__item__value viewers">82,552</span>

How can I now filter this xml_nodeset? Or specify multiple classes in html_nodes?


#2

You can do this:

in_learning_page %>%
  html_nodes(".content__info__item__value") %>% 
  str_subset(., "viewers")

This is assuming that it will always have the viewers tag as well. You could also do this:

in_learning_page %>%
  html_nodes(".content__info__item__value") %>% 
  .[[4]]

This is assuming that it will always be the 4th element in the returned vector


#3

Thanks @tbradley, I’d like something ideally that I can then pass to html_text so as nicely extract the contents of the span tag without relying on regex a la https://stackoverflow.com/a/1732454/1659890


#4

The argument is a standard CSS selector so you can specify either or both

#has either class in_learning_page

html_nodes(".content__info__item__value, skill")
{xml_nodeset (4)}
[1] <span class="content__info__item__value duration">5h 59m 42s</span>
[2] <span class="content__info__item__value skill">Beginner + Intermediate</span>
[3] <span class="content__info__item__value released">September 26, 2013</span>
[4] <span class="content__info__item__value viewers">82,552</span>

# has both classes in_learning_page
html_nodes(".content__info__item__value.skill")
{xml_nodeset (1)}
[1] <span class="content__info__item__value skill">Beginner + Intermediate</span>

#5

Ah @danr thanks! I tried every combination except for

.class1.class2

Now I think about it that should’ve been obvious and I have what I wanted :slight_smile:

library("tidyverse")
#> ── Attaching packages ───────────────────────────────────────────────────────────────── tidyverse 1.2.1 ──
#> ✔ ggplot2 2.2.1.9000     ✔ purrr   0.2.4     
#> ✔ tibble  1.3.4          ✔ dplyr   0.7.4     
#> ✔ tidyr   0.7.2          ✔ stringr 1.2.0     
#> ✔ readr   1.1.1          ✔ forcats 0.2.0
#> ── Conflicts ──────────────────────────────────────────────────────────────────── tidyverse_conflicts() ──
#> ✖ dplyr::filter() masks stats::filter()
#> ✖ dplyr::lag()    masks stats::lag()
library("rvest")
#> Loading required package: xml2
#> 
#> Attaching package: 'rvest'
#> The following object is masked from 'package:purrr':
#> 
#>     pluck
#> The following object is masked from 'package:readr':
#> 
#>     guess_encoding

in_learning_url <- "https://www.linkedin.com/learning/r-statistics-essential-training?u=2125562"

in_learning_page <- read_html(in_learning_url)

in_learning_page %>%
  html_nodes(".content__info__item__value.viewers") %>%
  html_text() %>%
  parse_number()
#> [1] 82552