Hi community
I'm obtain this data with web scraping. I'm want select only the author name, for make a columns like Author, Year ,Title, DOI ,Other info
This is an example of larger data set with the same estructure.
Im dont know how to make maybe with gsub
DATAVERSE <-c("Hyman, Glenn Graham, 2020, \"Global Climate Regions for Cassava\", https://doi.org/10.7910/DVN/WFAMUM, Harvard Dataverse, V2",
"Dyer, George; González, Carolina; Lopera, Diana C, 2012, \"Informal “Seed” systems and the management of gene flow in traditional agroecosystems: the case of cassava in Cauca, Colombia\", https://doi.org/10.7910/DVN/DEWGIF, Harvard Dataverse, V1"
)
Author | year | Title | DOI | other info |
---|---|---|---|---|
Hyman, Glenn Graham | 2020 | Global Climate Regions for Cassava | https://doi.org/10.7910/DVN/WFAMUM | Harvard Dataverse, V2 |
Dyer, George; González, Carolina; Lopera, Diana C | 2012 | Informal “Seed” systems and the management of gene flow in traditional agroecosystems: the case of cassava in Cauca, Colombia | Informal “Seed” systems and the management of gene flow in traditional agroecosystems: the case of cassava in Cauca, Colombia - CIAT - International Center for Tropical Agriculture Dataverse | Harvard Dataverse, V1 |
Tnks!
Im make this: the idea is obtain the 23 items
Im get different columns but the author is mixed cell.
library(rvest)
library(xml2)
library(dplyr)
library(tm)
library(httr)
website <-"https://dataverse.harvard.edu/dataverse/harvard?q=cassava&fq1=authorAffiliation_ss%3A%22International+Center+for+Tropical+Agriculture+-+CIAT%22&fq0=dvObjectType%3A%28dataverses+OR+datasets+OR+files%29&types=dataverses%3Adatasets%3Afiles&sort=score&order="
website <- GET(website, add_headers('user-agent' = 'Gov employment data scraper ([[your email]])'))
Title <- vector()
Fecha <- vector()
link <- vector()
Autores <- vector()
#loop through nodes
for (i in 1:10){
Title[i]<- website %>%
read_html() %>%
html_nodes(xpath=paste0(' //*[@id="resultsTable"]/tbody/tr[',i,']/td/div/div[1]')) %>%
html_text(trim = T)
Fecha[i] <-website %>%
read_html() %>%
html_nodes(xpath=paste0(' //*[@id="resultsTable"]/tbody/tr[',i,']/td/div/span[1]')) %>%
html_text(trim = T)
link[i] <-website %>%
read_html() %>%
html_nodes(xpath=paste0(' //*[@id="resultsTable"]/tbody/tr[',i,']/td/div/div[3]/a')) %>%
html_text(trim = T)
Autores[i] <-website %>%
read_html() %>%
html_nodes(xpath=paste0(' //*[@id="resultsTable"]/tbody/tr[',i,']/td/div/div[3]')) %>%
html_text(trim = T)
}
pag1 <- data.frame(Title,Fecha,link, Autores)