Scrape this html and search links

Hi community

Im want scrape this page and the other in links box search.
Im have some problems because some nodes have the same name but are different items.

library(rvest)
library(xml2)
library(dplyr)
library(tibble)
library(lubridate)
library(tm)

url<-"https://cgspace.cgiar.org/discover?rpp=10&etal=0&query=cassava&scope=10568/35697&group_by=none&page=1"

url <- GET(url, add_headers('user-agent' = 'Gov employment data scraper ([[your email]])'))

text_html <- url %>% read_html()
text_html


Title<-text_html %>% 
  html_nodes(".description-info") %>% 
  html_text(trim = T)  
Title

# has the same node name. That's why there are 20 entries in Title, since it takes the other 10 from the author of the page
Autor<-text_html %>% 
  html_nodes(".description-info") %>% 
  html_text(trim = T)  


date <-text_html %>% 
  html_nodes(".date") %>% 
  html_text(trim = T)

# Many values because are various of this node in different zones.
Type <-text_html %>% 
  html_nodes(".artifact-type") %>% 
  html_text(trim = T)


For select the final, 324 page 
p_ultima <- '//*[@id="aspect_discovery_SimpleSearch_div_search"]/div[4]/div/ul/li[7]/a'


Some helps or suggest for make this.
The idea is have a df with this variables.

Below is a function for scraping the elements on the page, as well as an example of collecting the first two pages.

library(tidyverse)
library(rvest)

scrape_page = function(i) {
  
  url <- paste0("https://cgspace.cgiar.org/discover?rpp=10&etal=0&query=cassava&scope=10568/35697&group_by=none&page=",
                i)
  
  df = read_html(url) %>%
    html_nodes('.descriptionlabel , .date , .description-info , .artifact-type') %>%
    html_text(trim = T) %>%
    tibble() %>%
    rename_at(1, ~paste('content')) %>%
    filter(!content %in% c('Type:', 'Status:')) %>%
    mutate(content = str_replace(content, 'Status:', 'Status:|'),
           content = str_replace(content, 'Type:', 'Type:|')) %>%
    separate_rows(content, sep = '\\|') %>%
    mutate(row = row_number() %% 2)
  
  out = tibble(label = df$content[df$row == 1],
               value = df$content[df$row == 0]) %>%
    mutate(label = str_replace(label, ':', '')) %>%
    # every entry contains 5 rows of data
    mutate(entry = ceiling(row_number()/5)) %>%
    pivot_wider(names_from = label, values_from = value) %>%
    mutate(search_page = i) %>%
    select(search_page, everything())
  
  out
  
}

final_output = lapply(1:2, scrape_page) %>%
  bind_rows()

final_output
#> # A tibble: 20 Ă— 7
#>    search_page entry Title                            Authors Date  Type  Status
#>          <int> <dbl> <chr>                            <chr>   <chr> <chr> <chr> 
#>  1           1     1 Industrializacion de la yuca     Díaz D… 1972  Repo… Open …
#>  2           1     2 Cassava: a resilient crop with … Intern… 2014… Image Open …
#>  3           1     3 CIAT's Tony Bellotti talks abou… Bellot… 2011… Video Open …
#>  4           1     4 Cassava in Asia: a potential ne… Howele… 2010  Conf… Open …
#>  5           1     5 Mealybug threat to cassava       Intern… 2014… Image Open …
#>  6           1     6 A socio-economic study of cassa… Strobo… 1976  Book  Open …
#>  7           1     7 Manual for the construction and… Ospina… 1981  Manu… Open …
#>  8           1     8 Advances on Genome Edition of C… Chavar… 2018  Conf… Limit…
#>  9           1     9 Diseases affecting cassava       Legg, … 2017  Book… Limit…
#> 10           1    10 Cassava cultivation and starch … Strobo… 1979  Repo… Open …
#> 11           2     1 Drivers of change for cassava’s… Hershe… 2017… Book… Limit…
#> 12           2     2 The development of a through ci… Best, … 1981  Repo… Open …
#> 13           2     3 Cassava Genetics’ data manageme… Becerr… 2015… Pres… Open …
#> 14           2     4 Guia para la construccion de un… Herrer… 1983  Manu… Open …
#> 15           2     5 The viruses and virus diseases … Calver… 2002  Book… Open …
#> 16           2     6 Development and use of biotechn… Intern… 2004  Book… Open …
#> 17           2     7 Molecular approaches in cassava… Becerr… 2017  Book… Limit…
#> 18           2     8 GCP21: a global cassava partner… Fauque… 2017… Book… Limit…
#> 19           2     9 Developing new cassava varietie… Ceball… 2017  Book… Limit…
#> 20           2    10 Positional cloning of CMD2 the … Moreno… 2004  Post… Open …

Created on 2022-10-18 with reprex v2.0.2

1 Like

This an amaizing response. Im want have this level about web scraping. :handshake:

When im change the number page:

final_output = lapply(1:324, scrape_page) %>% # For get all pages
  bind_rows()

show this error:

Error in `stop_vctrs()`:
! Can't combine `..1$Title` <character> and `..24$Title` <list>.
Run `rlang::last_error()` to see where the error occurred.
Warning message:
Values are not uniquely identified; output will contain list-cols.
* Use `values_fn = list` to suppress this warning.
* Use `values_fn = length` to identify where the duplicates arise
* Use `values_fn = {summary_fun}` to summarise duplicates 

Im try with 20 pages and not have any errors. But when I put 50 pages, appear this same error.

Im check in manual form this pages but not find any differences in the items.

# some problem pages
# 24 - 99 - 185 - 214 - 280 - 297 # very extrange

# every entry contains 5 rows of data was true for the first two pages, but not true for all of the others. I updated the out section of the function with the code below and encountered no errors when running through the first 50 pages. I also ran through each of the problem pages you provided (thank you!) and encountered no errors. The reason those pages errored is because one entry on each page was missing either a Date, Type, or Status.

out = tibble(label = df$content[df$row == 1],
               value = df$content[df$row == 0]) %>%
    mutate(label = str_replace(label, ':', '')) %>%
    # group labels together for an "entry"
    mutate(entry = ifelse(label == 'Title', 1, 0),
           entry = cumsum(entry)) %>% 
    pivot_wider(names_from = label, values_from = value) %>%
    mutate(search_page = i) %>%
    select(search_page, everything())

the new out run very well all links in 30.46885 mins.

Again tnks!

This topic was automatically closed 7 days after the last reply. New replies are no longer allowed.

If you have a query related to it or one of the replies, start a new topic and refer back with a link.