creating new variables from long texts using vectors or dictionaries

AgusArgentina · September 27, 2022, 8:44pm

Hi everybody!
I am working with long texts from the newspapers texts and I want
to create new variables to codify some topics of the news.
For example, if the content of the title refers to labor or an educational issue.
I want to codify every new with an 'issue' variable containing 'labor' or 'education' as categories.

The reprex:

news_DF <- tibble(newspaper=c('New York Times', 'Washington Post', 'The Times', 'The Times'),
title=c('Workers are striking all over the world',
'Workers are not striking in March 2009',
'The scholarship students in America are not well paid',
'The US employees are not part of the working class'))

The words referring to the 'labor' type of issue can be:
labor_vector <- c('workers', 'teachers', 'employees', 'unions', 'AFL-CIO')

How I do that without writing every single element of a long list of words-
as the code below- but using vectors like the 'labor_vector'?

news_DF2 <- news_DF %>%
mutate(isse = case_when
(str_detect(title, 'Workers') ~ 'labor',
str_detect(title, 'employees') ~ 'labor',
str_detect(title, 'students') ~ 'education'))

nirgrahamuk · September 28, 2022, 10:06am

Here I have an example of a sort of function factory, for in this case labor, this might be repeated to a few others; it may even be possible to do a function factory factory if there are too many categories.

library(tidyverse)
news_DF <- tibble(newspaper=c('New York Times', 'Washington Post', 'The Times', 'The Times'),
                  title=c('Workers are striking all over the world',
                          'Workers are not striking in March 2009',
                          'The scholarship students in America are not well paid',
                          'The US employees are not part of the working class'))

labor_vector <- c('workers', 'teachers', 'employees', 'unions', 'AFL-CIO')
labor_funcs <- map(labor_vector,
                     ~function(x)str_detect(tolower(x),
                                     pattern = tolower(.x)))
labor_eval <- function(x) {
               any(map_lgl(labor_funcs, ~ .x(x)))}
# test 
> labor_eval("workers")
[1] TRUE
> labor_eval("workdrs")
[1] FALSE


news_DF |> rowwise() |> 
mutate(issue=
        case_when(labor_eval(title) ~ 'labor')) |> 
ungroup()

AgusArgentina · September 29, 2022, 1:01pm

Thank you so much nirgrahamuk,
I am sorry for the delayed response here. It seems to be everthing OK, and working!

Best, Agustin

system · October 6, 2022, 1:01pm

This topic was automatically closed 7 days after the last reply. New replies are no longer allowed.

If you have a query related to it or one of the replies, start a new topic and refer back with a link.