Rvest scrapping Multiple pages

Hi I'm trying to scrap all comments on each sub topic under http://www.essentialbaby.com.au/forums/index.php?/forum/232-sleeping/

I'm not sure how to modify my code to extract not only 20 sub topics but all of them, also not sure how to map the reply/ other comments on a sub topic, below is my code:

:partying_face::partying_face::partying_face::partying_face::partying_face: Hello @tamara :partying_face::partying_face::partying_face::partying_face::partying_face:

Welcome to the wonderful RStudio Community.

The code I provide below will help you achieve your scraping goals in the www.essentialbaby.com website. Be warned; however, that I use slightly different column names in my code, so if you are not really sure, do not hesitate to let me know.

It's important that you know this! You would like to not only scrape the first 20 threads/posts, but all of them! This can be problematic as it might take a really long time to run (there are 166 pages with 20 threads in each page). It is; however, not an impossible task. For this reason, I created a function (i.e. scrape_ebaby_bypage()) in which you can specify the pages that you would like to scrape using the (only) argument page_numbers (e.g scrape_ebaby_bypage(page_numbers = 1:5) will scrape the first 5 pages)). Therefore, calling the function with page_numbers = 1:166 will scrape all pages.

Your code only scrapes the first 20 threads simply because you only scrape the first page. So what I did is to create the links to the other pages in order to scrape them too.

The first function below, scrape_thread_data() is a function, which is used inside the main scrape_ebaby_bypage() function. So, just run the functions in that order before using the latter for your scraping needs.

Finally, the first two pages are scraped at the end of the code. The output of the main function is nested, but you can easily unnest it as is also shown in the code.

  1. Functions to scrape the data
# Load required packages
pacman::p_load(rvest, dplyr, stringr, purrr, lubridate, tibble, tidyr)

# Secondary custom function which scrapes data in each thread
# Input:
# - thread_link <chr>: link of the thread to scrape
# Output: tibble with the following columns: 
# - participant <chr>: name of the poster
# - post_date <dttm>: date of the post
# - post <chr>: content of the post

scrape_thread_data <- function(thread_link){
  
  thread_html <- read_html(thread_link)
  
  participant <- thread_html %>%
    html_nodes(css = ".guest , .vcard") %>%
    html_text() %>%
    str_squish()
  
  post_date <- thread_html %>%
    html_nodes(css = ".published") %>%
    html_text() %>%
    enframe(name = "id") %>%
    mutate(value = str_replace_all(string = value, pattern = " -", replacement = "")) %>%
    separate(col = value, into = c("day", "month", "year", "time", "time_of_day"), sep = " ") %>%
    separate(col = time, into = c("hour", "min"), sep = ":") %>%
    mutate_at(vars(day, year, hour, min), as.integer) %>%
    mutate(month = match(month, month.name)) %>%
    transmute(time = ISOdatetime(year = year, month = month, day = day, hour = hour, min = min, sec = 0)) %>%
    pull()
  
  post <- thread_html %>%
    html_nodes(css = ".entry-content") %>%
    html_text() %>%
    str_trim()
  
  tibble(participant, post_date, post)
}

# Main function which creates a master data set
# Input:
# - page_numbers <numeric>: numeric vector specifying the pages to scrape (default is 1)
# Output is a tibble with the following columns:
# - thread_creator <chr>: name of the creator of the thread
# - date <date>: date of creation of thread
# - thread_title <chr>: title of the thread
# - thread_url <chr>: Link of the thread (serves as input to the scrape_thread_data() function above)
# - thread_data <list>: a column list containing the output of the scrape_thread_data() function for each thread

scrape_ebaby_bypage <- function(page_numbers = 1){
  
  page_urls <- c("http://www.essentialbaby.com.au/forums/index.php?/forum/232-sleeping/",
                 paste0("http://www.essentialbaby.com.au/forums/index.php?/forum/232-sleeping/page__prune_day__100__sort_by__Z-A__sort_key__last_post__topicfilter__all__st__", 1:165 * 20))
  
  urls <- page_urls[page_numbers]
  
  htmls <- map(urls, read_html)
  
  thread_url <- map(htmls, function(html){
    html %>%
      html_nodes(".topic_title,.a")%>%
      html_attr(name = "href")
  }) %>%
    flatten_chr()
  
  # Scrape post title
  
  thread_title <- map(htmls, function(html){
    html %>%
      html_nodes(css = ".topic_title") %>%
      html_text() %>%
      str_trim()
  }) %>%
    flatten_chr()
  
  thread_creator_and_date <- map_dfr(htmls, function(html){
    html %>%
      html_nodes(css = ".lighter") %>%
      html_text() %>%
      str_trim() %>%
      enframe(name = "id") %>%
      mutate(value = str_replace_all(string = value, pattern = "Started by |\n|\t", replacement = "")) %>%
      separate(col = value, into = c("thread_creator", "date"), sep = ", ") %>%
      mutate(date = dmy(date)) %>%
      select(-id)
  })
  
  master_data <- bind_cols(thread_creator_and_date, thread_title = thread_title, thread_url = thread_url) %>%
    mutate(thread_data = map(thread_url, scrape_thread_data))
  
   master_data
}
  1. Using the functions to scrape
dat <- scrape_ebaby_bypage(page_numbers = 1:2)

dat

# A tibble: 40 x 5
   thread_creator  date       thread_title                                      thread_url                                                                 thread_data  
   <chr>           <date>     <chr>                                             <chr>                                                                      <list>       
 1 lucky 2         2014-06-03 Sleep Schools (Early Parenting Centres)- members… http://www.essentialbaby.com.au/forums/index.php?/topic/1130252-sleep-sch… <tibble [10 …
 2 Shellby         2010-02-25 Control Crying Alternatives                       http://www.essentialbaby.com.au/forums/index.php?/topic/770816-control-cr… <tibble [2 ×…
 3 Shellby         2009-11-08 New Moderator                                     http://www.essentialbaby.com.au/forums/index.php?/topic/736132-new-modera… <tibble [1 ×…
 4 .Ally.          2008-06-04 Read this before posting!                         http://www.essentialbaby.com.au/forums/index.php?/topic/546955-read-this-… <tibble [1 ×…
 5 Caribou         2019-06-04 Farewell, Au revoir, Auf Wiedersehen, To Day Sle… http://www.essentialbaby.com.au/forums/index.php?/topic/1204169-farewell-… <tibble [25 …
 6 Zeppelina       2019-05-13 8yo and sleep anxiety                             http://www.essentialbaby.com.au/forums/index.php?/topic/1203750-8yo-and-s… <tibble [8 ×…
 7 PandoBox        2019-05-06 I completely ruined her sleep , how do I fix it?  http://www.essentialbaby.com.au/forums/index.php?/topic/1203593-i-complet… <tibble [16 …
 8 Davidoff-sensei 2019-04-24 4 month old absolutely hates nap/bed time. Screa… http://www.essentialbaby.com.au/forums/index.php?/topic/1203358-4-month-o… <tibble [25 …
 9 joeyinthesky    2017-09-02 13mo crazy sleep issues                           http://www.essentialbaby.com.au/forums/index.php?/topic/1189771-13mo-craz… <tibble [22 …
10 Kattikat        2019-03-21 18 Mo old thinks she's a newborn                  http://www.essentialbaby.com.au/forums/index.php?/topic/1202716-18-mo-old… <tibble [3 ×…
# … with 30 more rows

unnest(dat)

# A tibble: 552 x 7
   thread_creator date       thread_title                  thread_url                              participant post_date           post                                 
   <chr>          <date>     <chr>                         <chr>                                   <chr>       <dttm>              <chr>                                
 1 lucky 2        2014-06-03 Sleep Schools (Early Parenti… http://www.essentialbaby.com.au/forums… lucky 2     2014-06-03 11:06:00 "Hi,\nA thread has been suggested wh…
 2 lucky 2        2014-06-03 Sleep Schools (Early Parenti… http://www.essentialbaby.com.au/forums… Ellen101    2014-06-05 09:51:00 "One for the neutral camp \nWe recen…
 3 lucky 2        2014-06-03 Sleep Schools (Early Parenti… http://www.essentialbaby.com.au/forums… Muffintop   2014-06-05 10:22:00 "Neutral again I think.\nWe attended…
 4 lucky 2        2014-06-03 Sleep Schools (Early Parenti… http://www.essentialbaby.com.au/forums… silverbubb… 2014-07-27 10:12:00 "Amazing, positive results. Have att…
 5 lucky 2        2014-06-03 Sleep Schools (Early Parenti… http://www.essentialbaby.com.au/forums… libbylu     2014-07-27 10:28:00 Positive - I attended a day stay pro…
 6 lucky 2        2014-06-03 Sleep Schools (Early Parenti… http://www.essentialbaby.com.au/forums… BeakyHoney… 2014-08-17 08:00:00 "These replies are great. \nI have a…
 7 lucky 2        2014-06-03 Sleep Schools (Early Parenti… http://www.essentialbaby.com.au/forums… RockLobster 2014-08-18 09:59:00 "FERALfoxgirls, on 17 August 2014 - …
 8 lucky 2        2014-06-03 Sleep Schools (Early Parenti… http://www.essentialbaby.com.au/forums… Charli73    2014-08-18 10:13:00 "I was in a public melbourne sleep s…
 9 lucky 2        2014-06-03 Sleep Schools (Early Parenti… http://www.essentialbaby.com.au/forums… Natttmumm   2014-08-18 10:29:00 "We went to Tresillian in Sydney qui…
10 lucky 2        2014-06-03 Sleep Schools (Early Parenti… http://www.essentialbaby.com.au/forums… nup         2016-04-21 06:40:00 "A very strong negative from me on a…
# … with 542 more rows

Hope this helps, and, once again, welcome to the community!

This topic was automatically closed 21 days after the last reply. New replies are no longer allowed.