I am having issue with webscraping using this code

Hello all, please take a look at the codes below. I have written two different codes expecting the same output but they are returning HTTP error 404

review_pages <- vector()
for (i in 1:209){
  link=paste0("https://www.airlinequality.com/airline-reviews/air-canada/", i)
  airline <- read_html(link)%>% html_nodes(".text_content") %>% 
    html_text2()%>% tibble(id=i,text=.)
  review_pages <- c(review_pages, airline)
}
reviews<- review_pages

review_pages < 1:209
reviews <- map_dfr(review_pages, function(i){
  link <- paste0("https://www.airlinequality.com/airline-reviews/air-canada/page=",i)
  airline <- read_html(link)%>%
    html_nodes(".text_content") %>% html_text2() %>% 
    tibble(id=i, text =.)
  return(airline)
})

Try updating the link to the following:

link=paste0("https://www.airlinequality.com/airline-reviews/air-canada/page/", i, "/")

Its is still showing the same error

I was able to successfully run the code below, which includes the updated link. Package versions shown.

Are you experiencing the error on a particular page?

library(tidyverse) # version 1.3.2
library(rvest) # version 1.0.3

review_pages <- 1:209

reviews <- map_dfr(review_pages, function(i){
  link <- paste0("https://www.airlinequality.com/airline-reviews/air-canada/page/", i, "/")
  airline <- read_html(link)%>%
    html_nodes(".text_content") %>% html_text2() %>% 
    tibble(id=i, text =.)
  return(airline)
})

glimpse(reviews)
#> Rows: 2,087
#> Columns: 2
#> $ id   <int> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 3, 3,…
#> $ text <chr> "Not Verified | Stunning incompetence and disregard for comfort, …
max(reviews$id)
#> [1] 209

Created on 2023-01-28 with reprex v2.0.2

1 Like

Im run this code, run well. Obtain the same results:

> reviews
# A tibble: 2,087 × 2
      id text                                           
   <int> <chr>                                          
 1     1 Not Verified | Stunning incompetence and disre…
 2     1 Not Verified | We sat in seats 2D and 2F. From…
 3     1 Not Verified | I was booked on flight AC 1656 …
 4     1 ✅ Trip Verified | Casablanca to Montreal. My …
 5     1 ✅ Trip Verified | Disastrous delayed baggage …
 6     1 ✅ Trip Verified | Flight delays for maintenan…
 7     1 Not Verified | Coming back from Yellowknife ma…
 8     1 ✅ Trip Verified | Due to a ridiculous carry-o…
 9     1 ✅ Trip Verified | My suitcase didn't make it.…
10     1 Not Verified | The flight was delayed by 8 hou…
# … with 2,077 more rows
# ℹ Use `print(n = ...)` to see more rows
> glimpse(reviews)
Rows: 2,087
Columns: 2
$ id   <int> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 3, 3,…
$ text <chr> "Not Verified | Stunning incompetence and disregard for comfort, …
> 

:muscle:t4:

@scottyd22 and @M_AcostaCH, thanks so much. It has run successfully

@scottyd22 However, I tried separating the text column using the code below but the output was not what I expected

text_cleaning <- reviews %>% separate_rows(text, sep="|")

Please put the solution :white_check_mark: for @scottyd22 post. Im only copy and reproduce the that he make.
All credits all for he.

This topic was automatically closed 7 days after the last reply. New replies are no longer allowed.

If you have a query related to it or one of the replies, start a new topic and refer back with a link.