I'd like to scrape some data from LinkedIn Learning but have come up against a stumbling block in rvest
: how can I filter by two HTML classes at once?
Let's experiment with this page: https://www.linkedin.com/learning/r-statistics-essential-training?u=2125562. The number of viewers is displayed in the following span tag (if you're not logged into LinkedIn learning)
<span class="content__info__item__value viewers">82,552</span>
Extracting all html_nodes
with the class content__info__item__value yields an xml_nodeset
with 4 different spans:
library("tidyverse")
#> ββ Attaching packages βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ tidyverse 1.2.1 ββ
#> β ggplot2 2.2.1.9000 β purrr 0.2.4
#> β tibble 1.3.4 β dplyr 0.7.4
#> β tidyr 0.7.2 β stringr 1.2.0
#> β readr 1.1.1 β forcats 0.2.0
#> ββ Conflicts ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ tidyverse_conflicts() ββ
#> β dplyr::filter() masks stats::filter()
#> β dplyr::lag() masks stats::lag()
library("rvest")
#> Loading required package: xml2
#>
#> Attaching package: 'rvest'
#> The following object is masked from 'package:purrr':
#>
#> pluck
#> The following object is masked from 'package:readr':
#>
#> guess_encoding
in_learning_url <- "https://www.linkedin.com/learning/r-statistics-essential-training?u=2125562"
in_learning_page <- read_html(in_learning_url)
in_learning_page %>%
html_nodes(".content__info__item__value")
#> {xml_nodeset (4)}
#> [1] <span class="content__info__item__value duration">5h 59m 42s</span>
#> [2] <span class="content__info__item__value skill">Beginner + Intermedia ...
#> [3] <span class="content__info__item__value released">September 26, 2013 ...
#> [4] <span class="content__info__item__value viewers">82,552</span>
How can I now filter this xml_nodeset
? Or specify multiple classes in html_nodes
?