scraping messages in forum using rvest

:partying_face::partying_face::partying_face::partying_face::partying_face::partying_face::partying_face::partying_face: Welcome to the RStudio Community forum @anacho :partying_face::partying_face::partying_face::partying_face::partying_face::partying_face:

The following code scrapes all the data you need. It is obvious you have some experience in web scraping so I will not spend too much time explaining what the code does at the moment, but feel free to ask me questions and I will be more than glad to provide you with more details. :slight_smile: However, regarding the messages in the threads, what I did is to scrape the links to each thread and use them to access the thread pages and further scrape the messages. The final results of this script is a tidy data frame with one list-column (since that are several messages in each thread.

library(rvest)
library(dplyr)
library(stringr)
library(purrr)

# Scrape thread titles, thread links, authors and number of views

url <- "https://www.healthboards.com/boards/aspergers-syndrome/"
h <- read_html(url)

threads <- h %>%
  html_nodes("#threadslist .alt1 a") %>%
  html_text()

thread_links <- h %>%
  html_nodes("#threadslist .alt1 a") %>%
  html_attr(name = "href")

authors <- h %>%
  html_nodes("#threadslist .alt1 .smallfont") %>%
  html_text() %>%
  str_replace_all(pattern = "\t|\r|\n", replacement = "")

views <- h %>%
  html_nodes(".alt2:nth-child(6)") %>%
  html_text() %>%
  str_replace_all(pattern = ",", replacement = "") %>%
  as.numeric()


# Custom function to scrape messages in each thread

scrape_messages <- function(link){
  read_html(link) %>%
    html_nodes(css = ".smallfont~ hr+ div") %>%
    html_text() %>%
    str_replace_all(pattern = "\t|\r|\n", replacement = "") %>%
    str_trim()
}

# Create master dataset (and scrape messages in each thread in process)

master_data <- 
  tibble(threads, authors, views, thread_links) %>%
  mutate(messages = map(thread_links, scrape_messages)) %>%
  select(threads:views, messages, thread_links)

head(master_data)

  threads                            authors      views messages thread_links                                                            
  <chr>                              <chr>        <dbl> <list>   <chr>                                                                   
1 ADHD And Aspergers                 MyNameIsCra~  4973 <chr [3~ https://www.healthboards.com/boards/aspergers-syndrome/1035173-adhd-asp~
2 Adult Pants Pooping and Asperger'~ poopypants21  1680 <chr [4~ https://www.healthboards.com/boards/aspergers-syndrome/1037809-adult-pa~
3 I did NOT spoil him!               mery          5939 <chr [7~ https://www.healthboards.com/boards/aspergers-syndrome/921652-i-did-not~
4 ASD Assessment as an adult, how?   Dragonfly W~  1243 <chr [2~ https://www.healthboards.com/boards/aspergers-syndrome/1032212-asd-asse~
5 Sex and the single woman with AS   Madeofglass   7040 <chr [4~ https://www.healthboards.com/boards/aspergers-syndrome/973625-sex-singl~
6 I have aspergers and very severe ~ joe398        1445 <chr [2~ https://www.healthboards.com/boards/aspergers-syndrome/1029904-i-have-a~

Hope this helps.

1 Like