Web Scraping Course list from DataCamp's Data Scientists with R Track

realhiphop · December 28, 2018, 3:22pm

I just took the "Working with Web Data in R course", https://www.datacamp.com/courses/working-with-web-data-in-r. In order to try to put some of the content into practice, I'm trying to scrape the course content from (https://www.datacamp.com/tracks/data-scientist-with-r) into a data.frame. I am planning on taking that list and exporting to .csv in order to upload to my task management program (ClickUp).

I can't figure out how to get the names of the courses. I initially tried the following:

library(rvest)
#> Loading required package: xml2

test_url <- "https://www.datacamp.com/tracks/data-scientist-with-r"
test_xml <- read_html(test_url)


html_nodes(test_xml, css = ".dc-activity-block__stat-dropdown-link")
#> {xml_nodeset (0)}

Why am I unable to grab the contents of the text that comes after the "href=" and </a>? This is my first attempt at web scraping, so I might be missing something very obvious.

cderv · December 28, 2018, 9:12pm

where did you find this node class ? I can't find it in the page using dev tool pane in firefox. I think it is why it gets you nothing.
You need to select the correct css class of the node you want to extract. Do you want to look for the solution yourself as an exercice or do you want the answer ? (I manage to do it and have it if you want)

EDIT: I put one solution in hiding if you want to see, you can that way

Get a table with courses list from track

library(rvest)
#> Le chargement a nécessité le package : xml2

test_url <- "https://www.datacamp.com/tracks/data-scientist-with-r"
test_xml <- read_html(test_url)

library(tibble)
tibble::tibble(
  name = test_xml %>% html_nodes(".course-block__title") %>% html_text(),
  author = test_xml %>% html_nodes(".course-block__author-name") %>% html_text(),
  description = test_xml %>% html_nodes(".course-block__description") %>% html_text() %>% stringr::str_trim(),
  length = test_xml %>% html_nodes(".course-block__length") %>% html_text() %>% stringr::str_trim(),
  link = paste0("https://www.datacamp.com", test_xml %>% html_nodes(".course-block__link") %>% html_attr("href"))
)
#> # A tibble: 23 x 5
#>    name          author    description             length link             
#>    <chr>         <chr>     <chr>                   <chr>  <chr>            
#>  1 Introduction~ Jonathan~ Master the basics of d~ 4 hou~ https://www.data~
#>  2 Intermediate~ Filip Sc~ Continue your journey ~ 6 hou~ https://www.data~
#>  3 Introduction~ David Ro~ Get started on the pat~ 4 hou~ https://www.data~
#>  4 Importing Da~ Filip Sc~ In this course, you wi~ 3 hou~ https://www.data~
#>  5 Importing Da~ Filip Sc~ Parse data in any form~ 3 hou~ https://www.data~
#>  6 Cleaning Dat~ Nick Car~ Learn to explore your ~ 4 hou~ https://www.data~
#>  7 Importing & ~ Nick Car~ In this series of four~ 4 hou~ https://www.data~
#>  8 Writing Func~ Hadley W~ Learn the fundamentals~ 4 hou~ https://www.data~
#>  9 Data Manipul~ Garrett ~ Master techniques for ~ 4 hou~ https://www.data~
#> 10 Joining Data~ Garrett ~ This course will show ~ 4 hou~ https://www.data~
#> # ... with 13 more rows

^{Created on 2018-12-28 by the reprex package (v0.2.1)}

wolfpack · December 28, 2018, 9:29pm

I know this is a bit off topic, but have you tried the webscraper.io plugin for Chrome? It’s a point and click interface and allows you to easily extract data and iterate through a website. It’s very powerful and you can setup highly complex and inter dependent extracts to get what you need. It has its downsides in that it has a dependency on Chrome and must be user operated, but I have found it covers most use cases for web scraping.

realhiphop · December 28, 2018, 9:38pm

Thanks so much. Looks like I still need more practice. Any suggestions of any other resources to help me learn how to web scrape in R better?

cderv · December 28, 2018, 9:41pm

You should look out for blog post from the rstats community involving webscraping. It always good to look at example. You should find some about sport analytic I think - it is often data scraped from the web.

system · January 18, 2019, 9:41pm

This topic was automatically closed 21 days after the last reply. New replies are no longer allowed.