Error when scraping date in Q&A forum with rvest

anacho · April 28, 2019, 10:48am

Hello

thanks to @geyenono codes in previous posts I could scrape this Q&A forum
https://www.healthboards.com/boards/aspergers-syndrome/index1.html

Now I'm trying to add the date of each post using the scrape_dates function
It worked fine for the first page of posts, but when I run the same code below for the 2nd page

https://www.healthboards.com/boards/aspergers-syndrome/index2.html

I get the error

Error: Column thread_starters must be length 1 or 22, not 21

library(rvest)
library(dplyr)
library(stringr)
library(purrr)
library(tidyr)
library(RCurl)
library(openxlsx)
#library(xlsx)
#install.packages("xlsx")
# Scrape thread titles, thread links, authors and number of views

url <- "https://www.healthboards.com/boards/aspergers-syndrome/index2.html"


h <- read_html(url)

threads <- h %>%
  html_nodes("#threadslist .alt1 a") %>%
  html_text()

thread_links <- h %>%
  html_nodes("#threadslist .alt1 a") %>%
  html_attr(name = "href")

thread_starters <- h %>%
  html_nodes("#threadslist .alt1 .smallfont") %>%
  html_text() %>%
  str_replace_all(pattern = "\t|\r|\n", replacement = "")

views <- h %>%
  html_nodes(".alt2:nth-child(6)") %>%
  html_text() %>%
  str_replace_all(pattern = ",", replacement = "") %>%
  as.numeric()

# Custom functions to scrape author IDs and posts

scrape_posts <- function(link){
  read_html(link) %>%
    html_nodes(css = ".smallfont~ hr+ div") %>%
    html_text() %>%
    str_replace_all(pattern = "\t|\r|\n", replacement = "") %>%
    str_trim()
}



scrape_dates <- function(link){
  read_html(link) %>%
    html_nodes(css = "table[id^='post'] td.thead:first-child") %>%
    html_text() %>%
    str_replace_all(pattern = "\t|\r|\n", replacement = "") %>%
    str_trim()
}


scrape_author_ids <- function(link){
  h <- read_html(link) %>%
    html_nodes("div")
  
  id_index <- h %>%
    html_attr("id") %>%
    str_which(pattern = "postmenu")
  
  h %>%
    `[`(id_index) %>%
    html_text() %>%
    str_replace_all(pattern = "\t|\r|\n", replacement = "") %>%
    str_trim()
}


htmls <- map(thread_links, getURL)

# Create master dataset

master_data <-
  tibble(threads, thread_starters, thread_links) %>%
  mutate(
    post_author_id = map(htmls, scrape_author_ids),
    post = map(htmls, scrape_posts),
    dat=map(htmls, scrape_dates)
  ) %>%
  select(threads: post_author_id, post, thread_links,dat) %>%
  unnest()




titles<-master_data$threads
therad_starters<-master_data$thread_starters
#views<-master_data$views

post_author<-master_data$post_author_id
post<-master_data$post
da<-master_data$dat
employ.data <- data.frame(titles, therad_starters, post_author, post,da)


write.xlsx(employ.data, "C:/Asperger/2.xlsx",colNames = TRUE)

It's difficult for me to get the dates, as shown in the code above I had to use this
css = "table[id^='post'] td.thead:first-child"

which sometimes works and sometimes doesn't..

anacho · April 29, 2019, 11:20am

Hi! I found that the problem is not in the dates that I'm trying to add, the problem is when all the answers to a question are in more than one page, as in this post

https://www.healthboards.com/boards/aspergers-syndrome/859564-aspergers-talking-yourself.html

There is page 1 and page 2 of answers there and that is the problem, I'm missing them, it is strange because it seems that I can get both pages with this

.alt1 .smallfont~ div

But I can't

system · May 20, 2019, 11:20am

This topic was automatically closed 21 days after the last reply. New replies are no longer allowed.

If you have a query related to it or one of the replies, start a new topic and refer back with a link.