scraping Q&A forum + user info

Thanks to @gueyenono replies to my previous post (scraping messages in forum using rvest) now I could do similarly to another Q&A forum using the following code:

library(rvest)
library(dplyr)
library(stringr)
library(purrr)

# Scrape thread titles, thread links, authors and number of views

url <- "https://www.medhelp.org/forums/Aspergers-Syndrome/show/191?page=1"
#https://www.medhelp.org/forums/Aspergers-Syndrome/show/191?page=46
h <- read_html(url)

threads <- h %>%
  html_nodes(".subj_title a") %>%
  html_text()

threads

thread_links <- h %>%
  html_nodes(".subj_title a") %>%
  html_attr(name = "href")

thread_links<- paste("https://www.medhelp.org", thread_links, sep = '')

thread_links

author <- h %>%
  html_nodes(".username a") %>%
  html_text() %>%
  str_replace_all(pattern = "\t|\r|\n", replacement = "")

author

scrape_messages <- function(link){
  read_html(link) %>%
    html_nodes(css = "#subject_msg , .resp_body") %>%
    html_text() %>%
    str_replace_all(pattern = "\t|\r|\n", replacement = "") %>%
    str_trim()
}

# Create master dataset (and scrape messages in each thread in process)
library(tidyr)
library(tibble)

master_data <- 
  tibble(threads, author, thread_links) %>%
  mutate(messages = map(thread_links, scrape_messages)) %>%
  select(threads:author, messages, thread_links)%>%
  unnest()
write.csv(master_data, "C:/Asperger/page1.csv", na = "")

It works fine but I'm trying to add two things:

1- As the code is now, I'm storing for each thread only the userid that originated it, not the user ids that reply to it, so in the "author" column of the csv file it would be great to put the author of each post.
2- In this forum there's information about each user, for example for this user
https://www.medhelp.org/personal_pages/user/20824631
there's the "About me" information and I'm trying to create an "About me" column in the csv file.
But not all users have this information filled, for those that do not have it I'm just trying to leave it with NULL or NA but I couldn't....

Hey @anacho,

I was able to complete your first request, which was to scrape the author IDs in each thread. I had to change a few variable and function names. I also used the RCurl::getURL() function to save the htmls from all links into a variable and then scrape the data of interest from the variable. This is a good practice because the code repeatedly scrapes directly from the website and some websites will lock you out for doing so.

library(rvest)
library(dplyr)
library(stringr)
library(purrr)
library(tidyr)
library(RCurl)

# Scrape thread titles, thread links, authors and number of views

url <- "https://www.healthboards.com/boards/aspergers-syndrome/"

h <- read_html(url)

threads <- h %>%
  html_nodes("#threadslist .alt1 a") %>%
  html_text()

thread_links <- h %>%
  html_nodes("#threadslist .alt1 a") %>%
  html_attr(name = "href")

thread_starters <- h %>%
  html_nodes("#threadslist .alt1 .smallfont") %>%
  html_text() %>%
  str_replace_all(pattern = "\t|\r|\n", replacement = "")

views <- h %>%
  html_nodes(".alt2:nth-child(6)") %>%
  html_text() %>%
  str_replace_all(pattern = ",", replacement = "") %>%
  as.numeric()

# Custom functions to scrape author IDs and posts

scrape_posts <- function(link){
  read_html(link) %>%
    html_nodes(css = ".smallfont~ hr+ div") %>%
    html_text() %>%
    str_replace_all(pattern = "\t|\r|\n", replacement = "") %>%
    str_trim()
}
 
scrape_author_ids <- function(link){
  h <- read_html(link) %>%
    html_nodes("div") 
  
  id_index <- h %>%
    html_attr("id") %>%
    str_which(pattern = "postmenu")
  
  h %>%
    `[`(id_index) %>%
    html_text() %>%
    str_replace_all(pattern = "\t|\r|\n", replacement = "") %>%
    str_trim()
}


# Create master dataset

htmls <- map(thread_links, getURL)

master_data <- 
  tibble(threads, thread_starters, views, thread_links) %>%
  mutate(
    post_author_id = map(htmls, scrape_author_ids),
    post = map(htmls, scrape_posts)
  ) %>%
  select(threads:views, post_author_id, post, thread_links) %>%
  unnest()

head(master_data)

 threads              thread_starters views thread_links                                 post_author_id post                                                     
  <chr>                <chr>           <dbl> <chr>                                        <chr>          <chr>                                                    
1 ADHD And Aspergers   MyNameIsCrazy    5021 https://www.healthboards.com/boards/asperge~ MyNameIsCrazy  I have adhd and asperger syndrome and was wondering abou~
2 ADHD And Aspergers   MyNameIsCrazy    5021 https://www.healthboards.com/boards/asperge~ Dragonfly Win~ Hi there,My son has both, I have Inattentive ADHD and un~
3 ADHD And Aspergers   MyNameIsCrazy    5021 https://www.healthboards.com/boards/asperge~ DuckyBaby03    Hello, I understand what your going through. I also have~
4 Adult Pants Pooping~ poopypants21     1705 https://www.healthboards.com/boards/asperge~ poopypants21   I am a 42 year old male with Asperger's Syndrome and occ~
5 Adult Pants Pooping~ poopypants21     1705 https://www.healthboards.com/boards/asperge~ 7ash7          Hi, to help answer your question, do you conciously and/~
6 Adult Pants Pooping~ poopypants21     1705 https://www.healthboards.com/boards/asperge~ poopypants21   Accidentally. My GF does wear cloth diapers because she ~

As for your second request, I am not sure how you accessed the "About me" page on the website.

Hope this helps.

Thanks so much!! @gueyenono this helps because I will use also that forum but I think that I did not explain it clear:

I found another Q&A forum, now I'm working with this one

https://www.medhelp.org

Because there are much more questions and answers about Asperger here

https://www.medhelp.org/forums/Aspergers-Syndrome/show/191?page=1

there are 46 pages and it also shows the information of the users, in every question or answer when you click in the user you have the information...

Alright @anacho,

Here is the code that will scrape all the data you need from the forum. However, it is important to note that:

  • the code itself will run for a long time because there is A LOT to scrape! For this reason, I only scrape the first page, but the code should be able to scrape everything if you make the right changes

  • there are often comments under posts in each thread and those are not scraped here

library(dplyr)
library(rvest)
library(purrr)
library(RCurl)
library(stringr)
library(tidyr)


# Estimate the number of pages on the forum by dividing the number of pages by 20

page1_html <- getURL("https://www.medhelp.org/forums/Aspergers-Syndrome/show/191?page=1") 

n_pages <- page1_html %>%
  read_html() %>%
  html_node("div.forum_title") %>%
  html_text() %>%
  str_extract_all("\\d+") %>%
  flatten_chr() %>%
  as.numeric() %>%
  `[`(3) %>%
  {. / 20}

# Get all thread titles and thread links

page_urls <- paste0("https://www.medhelp.org/forums/Aspergers-Syndrome/show/191?page=", seq_len(n_pages))

page_htmls <- map_chr(page_urls[1], getURL) # use page_urls instead of page_urls[1] if you want to scrape everything!

scrape_thread_titles <- function(html){
  read_html(html) %>%
    html_nodes(".subj_title a") %>%
    html_text()
}

scrape_thread_links <- function(html){
  read_html(html) %>%
  html_nodes(".subj_title a") %>%
    html_attr("href") %>%
    paste0("https://www.medhelp.org", .)
}

thread_titles <- map(page_htmls, scrape_thread_titles) %>%
  discard(~ length(.x) == 0)

correct_n_pages <- length(thread_titles)

thread_titles <- thread_titles %>%
  flatten_chr()

thread_links <- map(page_htmls, scrape_thread_links) %>%
  `[`(seq_len(correct_n_pages)) %>%
  flatten_chr()

master_data <- tibble(thread_titles, thread_links)

# Scrape all thread posts and poster's IDs

thread_htmls <- map_chr(master_data$thread_links, getURL)

html <- thread_htmls[1]

link <- master_data$thread_links[1]

scrape_poster_ids <- function(html){
  read_html(html) %>%
    html_nodes(css = "span span") %>%
    html_text()
}

scrape_posts <- function(html){
  read_html(html) %>%
    html_nodes(".resp_body , #subject_msg") %>%
    html_text() %>%
    str_replace_all("\r|\n", "") %>%
    str_trim()
}

master_data <- master_data %>%
  mutate(
    poster_ids = map(thread_htmls, scrape_poster_ids),
    posts = map(thread_htmls, scrape_posts)
  ) %>%
  unnest()

head(master_data, 15)

   thread_titles thread_links                                  poster_ids   posts                                                                                 
   <chr>         <chr>                                         <chr>        <chr>                                                                                 
 1 Shutdown Mode https://www.medhelp.org/posts/Aspergers-Synd~ LearningGF   My boyfriend has Asperger's Sydrome. If he gets too confused, uncomfortable or hurt. ~
 2 Shutdown Mode https://www.medhelp.org/posts/Aspergers-Synd~ MJIthewriter When I shut down it's feeling overwhelmed.  imagine if you were thrown out in a hughw~
 3 Shutdown Mode https://www.medhelp.org/posts/Aspergers-Synd~ Sally44      I have a son who will be 8 in February.  When he gets overstimulated, or his expectat~
 4 Shutdown Mode https://www.medhelp.org/posts/Aspergers-Synd~ MaryannesMom "My Aspie husband would go through cycles, every couple of months he would need to be~
 5 Shutdown Mode https://www.medhelp.org/posts/Aspergers-Synd~ MJIthewriter Also headaches seem to trigger shutdowns. I had a bad one yesterday. Though the heada~
 6 Shutdown Mode https://www.medhelp.org/posts/Aspergers-Synd~ SueNYC       "Though I would say that my husband definitely does not have Asperger's, he definitel~
 7 Shutdown Mode https://www.medhelp.org/posts/Aspergers-Synd~ teburgan     hi Sue, I wanted to let you know I u derstand.  I should never have married my husban~
 8 Shutdown Mode https://www.medhelp.org/posts/Aspergers-Synd~ ryans93      "I have had various shut downs. Our minds simply cannot comprehend or deal with the s~
 9 Shutdown Mode https://www.medhelp.org/posts/Aspergers-Synd~ nbarslou     "My boyfriend of 9 months told me an old girlfriend said he had aspergers. My comment~
10 Shutdown Mode https://www.medhelp.org/posts/Aspergers-Synd~ Debraydebor~ "So happy to read your post. I have been desperate for more information to help me in~
11 Shutdown Mode https://www.medhelp.org/posts/Aspergers-Synd~ kristlep     I saw that its been awhile since you made this post um are you still with him because~
12 Shutdown Mode https://www.medhelp.org/posts/Aspergers-Synd~ MadMaddox999 "kristlep,\" ... it hurts a hole lot because when we first got together he was out go~
13 Shutdown Mode https://www.medhelp.org/posts/Aspergers-Synd~ RaeMinKai    "hello there, i have been married to my husband for close to 7 years and we have 3 ki~
14 Shutdown Mode https://www.medhelp.org/posts/Aspergers-Synd~ aerosmich    My boyfriend has asperger's and we have been living together for just about 9 months.~
15 Shutdown Mode https://www.medhelp.org/posts/Aspergers-Synd~ RUNNINGCATS  how long does a shut down last for, if the  person works with NT'smy friend started t~

As for the "About me" page, I am not sure exactly what you want to pull for that.

Hope this helps.

1 Like

@gueyenono this is great!!! thanks!
I think that with this code I'll try to find out how to include the About me info and I'll create a new post if I'm not able to do it so I will label this as solved, thanks a lot!

Okay, sounds like a plan. Glad I could help.

This topic was automatically closed 7 days after the last reply. New replies are no longer allowed.

If you have a query related to it or one of the replies, start a new topic and refer back with a link.