I'd like to scrape some data from LinkedIn Learning but have come up against a stumbling block in rvest: how can I filter by two HTML classes at once?
Let's experiment with this page: https://www.linkedin.com/learning/r-statistics-essential-training?u=2125562. The number of viewers is displayed in the following span tag (if you're not logged into LinkedIn learning)
<span class="content__info__item__value viewers">82,552</span>
Extracting all html_nodes with the class content__info__item__value yields an xml_nodeset with 4 different spans:
library("tidyverse")
#> ── Attaching packages ───────────────────────────────────────────────────────────────── tidyverse 1.2.1 ──
#> ✔ ggplot2 2.2.1.9000 ✔ purrr 0.2.4
#> ✔ tibble 1.3.4 ✔ dplyr 0.7.4
#> ✔ tidyr 0.7.2 ✔ stringr 1.2.0
#> ✔ readr 1.1.1 ✔ forcats 0.2.0
#> ── Conflicts ──────────────────────────────────────────────────────────────────── tidyverse_conflicts() ──
#> ✖ dplyr::filter() masks stats::filter()
#> ✖ dplyr::lag() masks stats::lag()
library("rvest")
#> Loading required package: xml2
#>
#> Attaching package: 'rvest'
#> The following object is masked from 'package:purrr':
#>
#> pluck
#> The following object is masked from 'package:readr':
#>
#> guess_encoding
in_learning_url <- "https://www.linkedin.com/learning/r-statistics-essential-training?u=2125562"
in_learning_page <- read_html(in_learning_url)
in_learning_page %>%
html_nodes(".content__info__item__value")
#> {xml_nodeset (4)}
#> [1] <span class="content__info__item__value duration">5h 59m 42s</span>
#> [2] <span class="content__info__item__value skill">Beginner + Intermedia ...
#> [3] <span class="content__info__item__value released">September 26, 2013 ...
#> [4] <span class="content__info__item__value viewers">82,552</span>
How can I now filter this xml_nodeset? Or specify multiple classes in html_nodes?