Clean author name in a string

M_AcostaCH · October 25, 2022, 6:54pm

Hi community

I'm obtain this data with web scraping. I'm want select only the author name, for make a columns like Author, Year ,Title, DOI ,Other info

This is an example of larger data set with the same estructure.

Im dont know how to make maybe with gsub

DATAVERSE <-c("Hyman, Glenn Graham, 2020, \"Global Climate Regions for Cassava\", https://doi.org/10.7910/DVN/WFAMUM, Harvard Dataverse, V2", 
"Dyer, George; González, Carolina; Lopera, Diana C, 2012, \"Informal “Seed” systems and the management of gene flow in traditional agroecosystems: the case of cassava in Cauca, Colombia\", https://doi.org/10.7910/DVN/DEWGIF, Harvard Dataverse, V1"
)

Author	year	Title	DOI	other info
Hyman, Glenn Graham	2020	Global Climate Regions for Cassava	https://doi.org/10.7910/DVN/WFAMUM	Harvard Dataverse, V2
Dyer, George; González, Carolina; Lopera, Diana C	2012	Informal “Seed” systems and the management of gene flow in traditional agroecosystems: the case of cassava in Cauca, Colombia	Informal “Seed” systems and the management of gene flow in traditional agroecosystems: the case of cassava in Cauca, Colombia - CIAT - International Center for Tropical Agriculture Dataverse	Harvard Dataverse, V1

Tnks!

Im make this: the idea is obtain the 23 items
Im get different columns but the author is mixed cell.

library(rvest)
library(xml2)
library(dplyr)
library(tm)
library(httr)

website <-"https://dataverse.harvard.edu/dataverse/harvard?q=cassava&fq1=authorAffiliation_ss%3A%22International+Center+for+Tropical+Agriculture+-+CIAT%22&fq0=dvObjectType%3A%28dataverses+OR+datasets+OR+files%29&types=dataverses%3Adatasets%3Afiles&sort=score&order="

website <- GET(website, add_headers('user-agent' = 'Gov employment data scraper ([[your email]])'))

Title <- vector()
Fecha <- vector()
link <- vector()
Autores <- vector()


#loop through nodes
for (i in 1:10){
  Title[i]<- website %>%
    read_html() %>%
    html_nodes(xpath=paste0(' //*[@id="resultsTable"]/tbody/tr[',i,']/td/div/div[1]')) %>% 
    html_text(trim = T) 
  
    
  Fecha[i] <-website  %>% 
  read_html() %>%
    html_nodes(xpath=paste0(' //*[@id="resultsTable"]/tbody/tr[',i,']/td/div/span[1]')) %>%
    html_text(trim = T)
  
  link[i] <-website  %>% 
    read_html() %>%
    html_nodes(xpath=paste0('  //*[@id="resultsTable"]/tbody/tr[',i,']/td/div/div[3]/a')) %>%
    html_text(trim = T)

  Autores[i] <-website  %>% 
    read_html() %>%
    html_nodes(xpath=paste0('  //*[@id="resultsTable"]/tbody/tr[',i,']/td/div/div[3]')) %>%
    html_text(trim = T)
}

pag1 <- data.frame(Title,Fecha,link, Autores)

HanOostdijk · October 25, 2022, 7:11pm

Hello @M_AcostaCH ,

I wonder if this is the best one can do with scraping. Can you tell us the source of DATAVERSE (on the web) ?

andresrcs · October 25, 2022, 7:20pm

The stringr package is very handy for dealing with strings, you can do something like this:

library(stringr)

DATAVERSE <-c("Hyman, Glenn Graham, 2020, \"Global Climate Regions for Cassava\", https://doi.org/10.7910/DVN/WFAMUM, Harvard Dataverse, V2", 
              "Dyer, George; González, Carolina; Lopera, Diana C, 2012, \"Informal “Seed” systems and the management of gene flow in traditional agroecosystems: the case of cassava in Cauca, Colombia\", https://doi.org/10.7910/DVN/DEWGIF, Harvard Dataverse, V1"
)

str_extract(DATAVERSE, "^.+?(?=,\\s\\d{4})")
#> [1] "Hyman, Glenn Graham"                              
#> [2] "Dyer, George; González, Carolina; Lopera, Diana C"

^{Created on 2022-10-25 with reprex v2.0.2}

Although, you might need to refine the regular expression to match your needs.

M_AcostaCH · October 25, 2022, 7:52pm

For my questions was the response. Because the other information Im get with web scraping.
I need learn more about clean this strings.

system · November 1, 2022, 7:52pm

This topic was automatically closed 7 days after the last reply. New replies are no longer allowed.

If you have a query related to it or one of the replies, start a new topic and refer back with a link.