I have problem with Rvest Package: i can not scrap multiple urls at the same time

rvest
#1

hi, i need your help
i try to scrap some data from multipl urls , as the code below shows. the problem is that the output gives me just the data from just one url despit of puting script that generate 3 urls.

#Loading the rvest package
library('rvest')

#Specifying the url for desired website to be scraped
page <- list()
inshows = c("100","200","400")
for(u in inshows) {
url <- paste0('https://www.imdb.com/search/title?title_type=feature&release_date=2016-01-01,2016-12-31&count=100&start=', u,'&ref_=adv_nxt')
page[[u]] <- read_html(url)
}
#Reading the HTML code from the website
webpage <- read_html(url)
#Using CSS selectors to scrap the rankings section
rank_data_html <- html_nodes(webpage,'.text-primary')

#Converting the ranking data to text
rank_data <- html_text(rank_data_html)
0 Likes

#2
for(u in inshows) {
url <- paste0('https://www.imdb.com/search/title?title_type=feature&release_date=2016-01-01,2016-12-31&count=100&start=', u,'&ref_=adv_nxt')
page[[u]] <- read_html(url)
}
#Reading the HTML code from the website
webpage <- read_html(url)

Here you are reading each url storing result in page variable but you don't use page but re-read the url with last value of u

So you get only the ranking data for this one.

Can you precise you issue and to a reprex of your probleme ?

Otherwise, you should apply html_nodes and html_text on each elements of page list. I think it will work

0 Likes

#3

Hi

my problem is that i want to enter to different pages, and for eich page i want to scrap data that i precised in Css selector.
the link of the page that i want to scrap from is
(((https://www.imdb.com/search/title?title_type=feature&release_date=2016-01-01,2016-12-31&count=100&start=200&ref_=adv_nxt)))) "link related to seconde page"
(((https://www.imdb.com/search/title?title_type=feature&release_date=2016-01-01,2016-12-31&count=100&start=300&ref_=adv_nxt))) "link related to the next page"
as you can remark here, from page to page the link still the same except the change occured in the last part of the link which is (#######start="...."&ref_=adv_nxt)
so i wanted to change the value of that part by replacing it by "u" and i gived for "u" different values. unfortunatly, it seems that im able to enter to just the last value of "u"

0 Likes

#4

Did you try to do that inside you for loop ?
What you are doing is working, for each u value, you build the url and you read the html. Then you should extract the value you want.

library('rvest')
#> Le chargement a nécessité le package : xml2

#Specifying the url for desired website to be scraped
page <- list()
inshows = c("100","200","400")
for(u in inshows) {
  url <- paste0('https://www.imdb.com/search/title?title_type=feature&release_date=2016-01-01,2016-12-31&count=100&start=', u,'&ref_=adv_nxt')
  page[[u]] <- read_html(url) %>% html_nodes('.text-primary') %>% html_text()
}
page
#> $`100`
#>   [1] "100." "101." "102." "103." "104." "105." "106." "107." "108." "109."
#>  [11] "110." "111." "112." "113." "114." "115." "116." "117." "118." "119."
#>  [21] "120." "121." "122." "123." "124." "125." "126." "127." "128." "129."
#>  [31] "130." "131." "132." "133." "134." "135." "136." "137." "138." "139."
#>  [41] "140." "141." "142." "143." "144." "145." "146." "147." "148." "149."
#>  [51] "150." "151." "152." "153." "154." "155." "156." "157." "158." "159."
#>  [61] "160." "161." "162." "163." "164." "165." "166." "167." "168." "169."
#>  [71] "170." "171." "172." "173." "174." "175." "176." "177." "178." "179."
#>  [81] "180." "181." "182." "183." "184." "185." "186." "187." "188." "189."
#>  [91] "190." "191." "192." "193." "194." "195." "196." "197." "198." "199."
#> 
#> $`200`
#>   [1] "200." "201." "202." "203." "204." "205." "206." "207." "208." "209."
#>  [11] "210." "211." "212." "213." "214." "215." "216." "217." "218." "219."
#>  [21] "220." "221." "222." "223." "224." "225." "226." "227." "228." "229."
#>  [31] "230." "231." "232." "233." "234." "235." "236." "237." "238." "239."
#>  [41] "240." "241." "242." "243." "244." "245." "246." "247." "248." "249."
#>  [51] "250." "251." "252." "253." "254." "255." "256." "257." "258." "259."
#>  [61] "260." "261." "262." "263." "264." "265." "266." "267." "268." "269."
#>  [71] "270." "271." "272." "273." "274." "275." "276." "277." "278." "279."
#>  [81] "280." "281." "282." "283." "284." "285." "286." "287." "288." "289."
#>  [91] "290." "291." "292." "293." "294." "295." "296." "297." "298." "299."
#> 
#> $`400`
#>   [1] "400." "401." "402." "403." "404." "405." "406." "407." "408." "409."
#>  [11] "410." "411." "412." "413." "414." "415." "416." "417." "418." "419."
#>  [21] "420." "421." "422." "423." "424." "425." "426." "427." "428." "429."
#>  [31] "430." "431." "432." "433." "434." "435." "436." "437." "438." "439."
#>  [41] "440." "441." "442." "443." "444." "445." "446." "447." "448." "449."
#>  [51] "450." "451." "452." "453." "454." "455." "456." "457." "458." "459."
#>  [61] "460." "461." "462." "463." "464." "465." "466." "467." "468." "469."
#>  [71] "470." "471." "472." "473." "474." "475." "476." "477." "478." "479."
#>  [81] "480." "481." "482." "483." "484." "485." "486." "487." "488." "489."
#>  [91] "490." "491." "492." "493." "494." "495." "496." "497." "498." "499."

Created on 2019-01-16 by the reprex package (v0.2.1)


on way to do it using tidyverse

library(rvest)
#> Le chargement a nécessité le package : xml2
library(purrr)
#> Warning: le package 'purrr' a été compilé avec la version R 3.5.2
#> 
#> Attachement du package : 'purrr'
#> The following object is masked from 'package:rvest':
#> 
#>     pluck

inshows = c("100","200","400")

urls <- glue::glue("https://www.imdb.com/search/title?title_type=feature&release_date=2016-01-01,2016-12-31&count=100&start={inshows}&ref_=adv_nxt")
urls
#> https://www.imdb.com/search/title?title_type=feature&release_date=2016-01-01,2016-12-31&count=100&start=100&ref_=adv_nxt
#> https://www.imdb.com/search/title?title_type=feature&release_date=2016-01-01,2016-12-31&count=100&start=200&ref_=adv_nxt
#> https://www.imdb.com/search/title?title_type=feature&release_date=2016-01-01,2016-12-31&count=100&start=400&ref_=adv_nxt

urls %>%
  set_names(inshows) %>%
  map(~ {
    read_html(.x) %>%
      html_nodes('.text-primary') %>% 
      html_text()
  })
#> $`100`
#>   [1] "100." "101." "102." "103." "104." "105." "106." "107." "108." "109."
#>  [11] "110." "111." "112." "113." "114." "115." "116." "117." "118." "119."
#>  [21] "120." "121." "122." "123." "124." "125." "126." "127." "128." "129."
#>  [31] "130." "131." "132." "133." "134." "135." "136." "137." "138." "139."
#>  [41] "140." "141." "142." "143." "144." "145." "146." "147." "148." "149."
#>  [51] "150." "151." "152." "153." "154." "155." "156." "157." "158." "159."
#>  [61] "160." "161." "162." "163." "164." "165." "166." "167." "168." "169."
#>  [71] "170." "171." "172." "173." "174." "175." "176." "177." "178." "179."
#>  [81] "180." "181." "182." "183." "184." "185." "186." "187." "188." "189."
#>  [91] "190." "191." "192." "193." "194." "195." "196." "197." "198." "199."
#> 
#> $`200`
#>   [1] "200." "201." "202." "203." "204." "205." "206." "207." "208." "209."
#>  [11] "210." "211." "212." "213." "214." "215." "216." "217." "218." "219."
#>  [21] "220." "221." "222." "223." "224." "225." "226." "227." "228." "229."
#>  [31] "230." "231." "232." "233." "234." "235." "236." "237." "238." "239."
#>  [41] "240." "241." "242." "243." "244." "245." "246." "247." "248." "249."
#>  [51] "250." "251." "252." "253." "254." "255." "256." "257." "258." "259."
#>  [61] "260." "261." "262." "263." "264." "265." "266." "267." "268." "269."
#>  [71] "270." "271." "272." "273." "274." "275." "276." "277." "278." "279."
#>  [81] "280." "281." "282." "283." "284." "285." "286." "287." "288." "289."
#>  [91] "290." "291." "292." "293." "294." "295." "296." "297." "298." "299."
#> 
#> $`400`
#>   [1] "400." "401." "402." "403." "404." "405." "406." "407." "408." "409."
#>  [11] "410." "411." "412." "413." "414." "415." "416." "417." "418." "419."
#>  [21] "420." "421." "422." "423." "424." "425." "426." "427." "428." "429."
#>  [31] "430." "431." "432." "433." "434." "435." "436." "437." "438." "439."
#>  [41] "440." "441." "442." "443." "444." "445." "446." "447." "448." "449."
#>  [51] "450." "451." "452." "453." "454." "455." "456." "457." "458." "459."
#>  [61] "460." "461." "462." "463." "464." "465." "466." "467." "468." "469."
#>  [71] "470." "471." "472." "473." "474." "475." "476." "477." "478." "479."
#>  [81] "480." "481." "482." "483." "484." "485." "486." "487." "488." "489."
#>  [91] "490." "491." "492." "493." "494." "495." "496." "497." "498." "499."

Created on 2019-01-16 by the reprex package (v0.2.1)

0 Likes

#5

Thank you verry much cderv. i know my mistake now, it works well now

Thank you verry much

1 Like

#6

If your question's been answered (even by you!), would you mind choosing a solution? It helps other people see which questions still need help, or find solutions if they have similar problems. Here’s how to do it:

0 Likes

#7

ok. no problem, its a duty.
I will prepare a great text to answer about that.
Thank you very much,

0 Likes

#8

hi professor
so my problem was related to selectors. normally, i had to put them ,like the code below, this just after read_html (url) and before page, why? because i want to have that data for every link

page[[u]] <- read_html(url) %>% html_nodes('.text-primary') %>% html_text()
}
page

so the mistake is that i puted them after page like the code below ,thats why it gives me just data related to the last choosed link

page[[u]] <- read_html(url)
}

webpage <- read_html(url)

rank_data_html <- html_nodes(webpage,'.text-primary')
0 Likes

closed #9

This topic was automatically closed 7 days after the last reply. New replies are no longer allowed.

If you have a query related to it or one of the replies, start a new topic and refer back with a link.

0 Likes