regex - function to extract title from journals references

Hi Folks,

I have a dataframe with some bibliographical references from academic journals. I would like to extract from it the year, the surname of the first author, and the title. I am trying to create three function to do that. I was sucessful to extract the year and the author, but I am struggling to extract the titles because there are a lot of different pattern. The title always came after the author that can be one, two or three. I was able to extract just the titles that came between quotes. But, just a few of them are in this pattern

Any suggestion? Thanks in advantages (and sorry about my english :/)

remotes::install_github("meirelesff/rscielo")
library(rscielo)
library(tidyverse)

refs <- get_article_references("http://www.scielo.br/scielo.php?script=sci_arttext&pid=S0104-62762020000100001&lng=en&nrm=iso&tlng=en")

Functions

extract_year <- function(string){
str_remove(str_trim(str_extract(string,"\\d{4} ?a?\\.? ?$")), "\\.")
}

extract_first_author <- function(string){
  str_remove(str_extract(string, "^(.+?),"), ",")
}

extract_art_title <- function(string){
  str_extract(string, "\\“(.+)\\”")
}

Code

refs %>% 
  mutate(year = extract_year(references),
         first_author = extract_first_author(references),
         title = extract_art_title(references)) %>%
  # filter(!is.na(title)) %>% # enable the filter to see the result of the function that I`was able to create
  glimpse()

Edit

For those who don`t have the rscielo package the references are like that

dput(refs$references %>% head())
c("Archer, J. C.; Taylor, P. J. Section and party: a political geography of American presidential elections, from Andrew Jackson to Ronald Reagan. Chichester: Wiley, 1981. ", 
"Bartolini, S.; Mair, P. Identity, competition and electoral availability: the stabilization of European electorates, 1885-1985. Cambridge: Cambridge University Press, 1990. ", 
"Blondel, J. The discipline of politics. London & Boston: Butterworths, 1981. ", 
"Bohn, S. R. \"Social policy and vote in Brazil: Bolsa Família and the shifts in Lula's electoral base\". Latin American Research Review, vol. 46, nº 1, p. 54-69, 2011. ", 
"Campbell, A. A classification of elections. In: Campbell, A., et al. (eds.). Elections and the political order. New York: Wiley, 1966. ", 
"Campbell, A., et al. The American voter. New York: Wiley, 1960. "
)

I was able to extract just the title that was between quotes " ". I have tried other patterns, but not sucessful

Can you post a sample of the references or titles that you are working with? I do not have the rscielo package.

Yes, I have edit my question to put a vector of the 6 first observation of the columun references for those don`t have the package. They are like that. The title always came after the author, that can be one, two or more, can start with quotes or not...

[1] "Archer, J. C.; Taylor, P. J. Section and party: a political geography of American presidential elections, from Andrew Jackson to Ronald Reagan. Chichester: Wiley, 1981. "
[2] "Bartolini, S.; Mair, P. Identity, competition and electoral availability: the stabilization of European electorates, 1885-1985. Cambridge: Cambridge University Press, 1990. "
[3] "Blondel, J. The discipline of politics. London & Boston: Butterworths, 1981. "
[4] "Bohn, S. R. "Social policy and vote in Brazil: Bolsa Família and the shifts in Lula's electoral base". Latin American Research Review, vol. 46, nº 1, p. 54-69, 2011. "
[5] "Campbell, A. A classification of elections. In: Campbell, A., et al. (eds.). Elections and the political order. New York: Wiley, 1966. "
[6] "Campbell, A., et al. The American voter. New York: Wiley, 1960. "

This topic was automatically closed 21 days after the last reply. New replies are no longer allowed.