Forum Scraping with rvest

Hello, nice to be here. I am a fairly new R user here. I want to scape discussions from a forum for text mining. I found the following thread and the code works perfectly for the forum in the original thread

However, i cannot get it to work for the following forum

https://www.gearslutz.com/board/mastering-forum/

Can anyone help me out? Many thanks

Hi @Compresearcher,

Web scraping is a skill that I recommend you learn because a script that scrapes a website will (with almost 100% certainty) not work in a different context. You mentioned that you are a new to R and I wish you the best in your learning experience.

Also, you need to know that scraping forums can take a long long long loooooonnnggg time, especially forums with many threads and many pages. So I wrote my functions such that you can determine the exact pages that you want to scrape (explained a bit more below).

  1. Install the packages you don't have and execute the functions below:
library(rvest)
library(dplyr)
library(purrr)
library(stringr)


# ===============================================================
# HELPER FUNCTIONS
# ===============================================================

# Get thread info (thread title and url)

scrape_thread_info <- function(page_url){
  
  html <- page_url %>%
    read_html() %>%
    html_nodes(css = ".icons+ a")
  
  tibble(
    title = html %>% html_text(),
    url = html %>% html_attr(name = "href")
  )
  
}

# Check if thread has a single page

has_single_page <- function(thread_url){
  
  thread_url %>%
    read_html() %>%
    html_node(css = ".brace") %>%
    html_text() %>%
    is.na()
}

# Scrape posts

scrape_posts <- function(thread_link){
  
  thread_link %>%
    read_html() %>%
    html_nodes(css = ".message_data") %>%
    html_text() %>%
    str_squish()
  
}

# ================================================================
# MAIN FUNCTION FOR SCRAPING
# ================================================================

scrape_gs_pages <- function(pages){
  
  # Find number of pages in the forum and get the link for each page
  
  page_1_url <- "https://www.gearslutz.com/board/mastering-forum/"
  page_1_html <- read_html(page_1_url)
  
  n_pages <- page_1_html %>%
    html_nodes(css = ".inner_nav .smallfont") %>%
    html_text() %>%
    str_extract(pattern = "\\d+") %>%
    as.numeric() %>%
    max(na.rm = TRUE)
  
  page_urls <- c(
    page_1_url,
    paste0("https://www.gearslutz.com/board/mastering-forum/index", 2:n_pages, ".html")
  )[pages]
  
  
  # Get threads info for the pages of interest
  
  master <- map_dfr(page_urls, scrape_thread_info)
  
  master %>%
    mutate(
      posts = url %>% map(scrape_posts)
    )
  
}
  1. Use the main function by specifying the pages that you want to scrape. Pages are specified using a numeric vector. The example below will scrape pages 1 and 2:
forum_data <- scrape_gs_pages(pages = c(1, 2))
forum_data

# A tibble: 104 x 3
   title                                                                     url                                                                                                            posts    
   <chr>                                                                     <chr>                                                                                                          <list>   
 1 Targeting Mastering Loudness for Streaming (LUFS, Spotify, YouTube)-Why ~ https://www.gearslutz.com/board/mastering-forum/1252522-targeting-mastering-loudness-streaming-lufs-spotify-y~ <chr [30~
 2 Mastering Studio robbed - Stolen Gear List                                https://www.gearslutz.com/board/mastering-forum/1175879-mastering-studio-robbed-stolen-gear-list.html          <chr [30~
 3 New Sub-Forum                                                             https://www.gearslutz.com/board/mastering-forum/885955-new-sub-forum.html                                      <chr [1]>
 4 Mastering Forum Guidelines                                                https://www.gearslutz.com/board/mastering-forum/150125-mastering-forum-guidelines.html                         <chr [1]>
 5 Lynx Hilo still a good purchase in 2019?                                  https://www.gearslutz.com/board/mastering-forum/1269153-lynx-hilo-still-good-purchase-2019-a.html              <chr [30~
 6 Leveller                                                                  https://www.gearslutz.com/board/mastering-forum/1271943-leveller.html                                          <chr [1]>
 7 Opinions on using EQ Match or other similar tools in the mastering proce~ https://www.gearslutz.com/board/mastering-forum/1271922-opinions-using-eq-match-other-similar-tools-mastering~ <chr [1]>
 8 About measurement methods, scales and their comparability                 https://www.gearslutz.com/board/mastering-forum/1271730-about-measurement-methods-scales-their-comparability.~ <chr [26~
 9 Hendy Amps Michelangelo                                                   https://www.gearslutz.com/board/mastering-forum/987380-hendy-amps-michelangelo.html                            <chr [30~
10 Extreme Mastering or Crazy Daisy                                          https://www.gearslutz.com/board/mastering-forum/199340-extreme-mastering-crazy-daisy.html                      <chr [6]>
# ... with 94 more rows

As you can see, the posts column in a list-column, so you will have to use the unnest function from the tidyr package to see the actual posts:

unnested_forum_data <- unnest(data = forum_data, cols = "posts")
unnested_forum_data

# A tibble: 1,992 x 3
   title                                     url                                                            posts                                                                                    
   <chr>                                     <chr>                                                          <chr>                                                                                    
 1 Targeting Mastering Loudness for Streami~ https://www.gearslutz.com/board/mastering-forum/1252522-targe~ "Below I am sharing something that I send to my mastering clients when they inquire abou~
 2 Targeting Mastering Loudness for Streami~ https://www.gearslutz.com/board/mastering-forum/1252522-targe~ "Hey Justin, Great post. For awhile I got on the bandwagon of multiple masters, one for ~
 3 Targeting Mastering Loudness for Streami~ https://www.gearslutz.com/board/mastering-forum/1252522-targe~ Hey Christopher, Thanks! Yes, density has become a desired sound quality for sure. In an~
 4 Targeting Mastering Loudness for Streami~ https://www.gearslutz.com/board/mastering-forum/1252522-targe~ You suggested one master which I love that idea or Would you recommend a 16bit/44.1 mast~
 5 Targeting Mastering Loudness for Streami~ https://www.gearslutz.com/board/mastering-forum/1252522-targe~ Quote: Originally Posted by sdd17 You suggested one master which I love that idea or Wou~
 6 Targeting Mastering Loudness for Streami~ https://www.gearslutz.com/board/mastering-forum/1252522-targe~ Delivering -14LUFS masters, unless specifically asked for, in any commercial genre, woul~
 7 Targeting Mastering Loudness for Streami~ https://www.gearslutz.com/board/mastering-forum/1252522-targe~ Thanks for the post. Absolute truth. The one time I mastered to -14LUFS, the client comp~
 8 Targeting Mastering Loudness for Streami~ https://www.gearslutz.com/board/mastering-forum/1252522-targe~ Quote: Originally Posted by Dave Polich I think we’re done with this stupidity regarding~
 9 Targeting Mastering Loudness for Streami~ https://www.gearslutz.com/board/mastering-forum/1252522-targe~ "Quote: Originally Posted by Trakworx Below I am sharing something that I send to my mas~
10 Targeting Mastering Loudness for Streami~ https://www.gearslutz.com/board/mastering-forum/1252522-targe~ Quote: Originally Posted by Trakworx make one digital master that sounds good, is not ov~
# ... with 1,982 more rows

I hope this helps.

1 Like

Hi Gueyeono,

Many thank you for the solution. Now we have two examples in this thread; it will be an excellent resource for those learning how to forum scrape.

One issue however, i can not get unnest to work. It does not extract the posts from the columns.

Also, may i ask how you would adapt the code to exclude quotes within posts?

Thank you once again for your help. Great poster!

Your screenshot shows that you have a posts column as well as a cols column and I am not sure how you got that. My answer above shows the output that you should get, 3 columns: title, url and posts. If you run the code I provided, you should not obtain an output, which is different from what I posted.

Also, it's always better to post your code in code chunks rather than screenshots. It makes it possible for potential helpers to easily copy and paste your code.

Finally, you can remove quotes with the help of the str_replace_all function, which is in the stringr package. Assuming that you are able to reproduce the results in my code, you can run:

mutate(unnested_forum_data, posts = str_replace_all(string = posts,  pattern = '\\"', replacement = ""))

As a newcomer to the R community, I would greatly encourage you to learn with this book: R for Data Science. Once you get the basics down, you can learn more about web scraping and other things. However, do not hesitate to ask more questions if you need to.

Hello Guelenono

I got it to work. I had to change the code to remove cols = "posts"

unnested_forum_data <- unnest(data = forum_data)

I noticed in the unnest {tidyr} does not have the argument cols. Perhaps it is a different version of tidyr you are using? I have the most up to date mac version

Hi,

You are right, I am using the development version of tidyr (which is the latest version). I am going to repost the full code here and this time it will work with your version of tidyr. I am also adding the code for removing quotes from the posts as you requested. Hope it helps.

library(rvest)
library(dplyr)
library(purrr)
library(stringr)
library(tidyr)


# ===============================================================
# HELPER FUNCTIONS
# ===============================================================

# Get thread info (thread title and url)

scrape_thread_info <- function(page_url){
  
  html <- page_url %>%
    read_html() %>%
    html_nodes(css = ".icons+ a")
  
  tibble(
    title = html %>% html_text(),
    url = html %>% html_attr(name = "href")
  )
  
}

# Check if thread has a single page

has_single_page <- function(thread_url){
  
  thread_url %>%
    read_html() %>%
    html_node(css = ".brace") %>%
    html_text() %>%
    is.na()
}

# Scrape posts

scrape_posts <- function(thread_link){
  
  thread_link %>%
    read_html() %>%
    html_nodes(css = ".message_data") %>%
    html_text() %>%
    str_squish()
  
}

# ================================================================
# MAIN FUNCTION FOR SCRAPING
# ================================================================

scrape_gs_pages <- function(pages){
  
  # Find number of pages in the forum and get the link for each page
  
  page_1_url <- "https://www.gearslutz.com/board/mastering-forum/"
  page_1_html <- read_html(page_1_url)
  
  n_pages <- page_1_html %>%
    html_nodes(css = ".inner_nav .smallfont") %>%
    html_text() %>%
    str_extract(pattern = "\\d+") %>%
    as.numeric() %>%
    max(na.rm = TRUE)
  
  page_urls <- c(
    page_1_url,
    paste0("https://www.gearslutz.com/board/mastering-forum/index", 2:n_pages, ".html")
  )[pages]
  
  
  # Get threads info for the pages of interest
  
  master <- map_dfr(page_urls, scrape_thread_info)
  
  master %>%
    mutate(
      posts = url %>% map(scrape_posts)
    )
  
}

forum_data <- scrape_gs_pages(pages = c(1, 2)) %>%
    unnest() %>%
    mutate(posts = str_replace_all(string = posts,  pattern = '\\"', replacement = ""))

forum_data

# A tibble: 2,019 x 3
   title                                url                                                   posts                                                        
   <chr>                                <chr>                                                 <chr>                                                        
 1 Targeting Mastering Loudness for St~ https://www.gearslutz.com/board/mastering-forum/1252~ "Below I am sharing something that I send to my mastering cl~
 2 Targeting Mastering Loudness for St~ https://www.gearslutz.com/board/mastering-forum/1252~ "Hey Justin, Great post. For awhile I got on the bandwagon o~
 3 Targeting Mastering Loudness for St~ https://www.gearslutz.com/board/mastering-forum/1252~ Hey Christopher, Thanks! Yes, density has become a desired s~
 4 Targeting Mastering Loudness for St~ https://www.gearslutz.com/board/mastering-forum/1252~ You suggested one master which I love that idea or Would you~
 5 Targeting Mastering Loudness for St~ https://www.gearslutz.com/board/mastering-forum/1252~ Quote: Originally Posted by sdd17 You suggested one master w~
 6 Targeting Mastering Loudness for St~ https://www.gearslutz.com/board/mastering-forum/1252~ Delivering -14LUFS masters, unless specifically asked for, i~
 7 Targeting Mastering Loudness for St~ https://www.gearslutz.com/board/mastering-forum/1252~ Thanks for the post. Absolute truth. The one time I mastered~
 8 Targeting Mastering Loudness for St~ https://www.gearslutz.com/board/mastering-forum/1252~ Quote: Originally Posted by Dave Polich I think we’re done w~
 9 Targeting Mastering Loudness for St~ https://www.gearslutz.com/board/mastering-forum/1252~ "Quote: Originally Posted by Trakworx Below I am sharing som~
10 Targeting Mastering Loudness for St~ https://www.gearslutz.com/board/mastering-forum/1252~ Quote: Originally Posted by Trakworx make one digital master~
# ... with 2,009 more rows
1 Like

Hi,

Fantastic work. The only issue is that remove quote code doesn't seem to work, see rows 4 and 5. Post 5 in row 5 still has the quoted text originally posted in post 4, which in this example tis the text " You suggested one master which I love". It would be excellent to take out the quoted text.

Once again, thank you for taking the time to help, this thread is a great resource and the book you linked looks perfect.

This topic was automatically closed 21 days after the last reply. New replies are no longer allowed.