R to assign IDs to texts

Hi R community,

As part of a research topic, we have quite a large number of texts (new articles).
These do not have an ID and we need them to be assigned one. Further, potentially comparing them for duplicates would be a useful option. And we would then need to export it back into text or .xls format.

Is there a clever way to do this in R? Have looked at a number of packages but have not seen one that would easily do this task.

Many thanks for your help!
Mel

Hi @Shapiro

How are the data stored currently? Providing a snippet of the data or some fake data that have the same format would be very helpful.

Are these academic/journal articles? I have an approach in mind that I think would work well for your situation, but I want to make sure I understand your situation well enough before making the suggestion.

Hi Matt,

Many thanks for your quick response!

These are stored as .docx documents, downloaded from Nexis Uni. These are actually news articles but will be used for a text analysis task in the academic field.

We are not sure which method to use for the text analysis, so in a first step we would like to assign IDs to each article and exclude duplicates. Then generate an output of the articles in .txt or .xls format again.

Thank you for your help!

I have also uploaded an example.

Example_document|356x499 .

I am not familiar with Nexis Uni, do they export any other format? Can you export citation lists rather than the Word documents themselves? I know of several good R tools to filter citation lists for duplicates.

EDIT: I did just stumble upon this vignette for the readtext package. Looks like you can read in Word files. You would still need to parse the text column to identify something unique (e.g. title).

If not, docx files can definitely be a pain to work with programmatically. I might suggest using pandoc to first convert all the docx files to a plain-text format of your choosing (e.g. HTML, XML, md, etc.). This can be done en masse so that you can convert all of the documents to a new format with one command from the command-line.

Once you have plain-text files you can parse the text a lot easier to check for duplicates. If you can share one or two of the .docx files I can try a few things out and see what works/doesn't work for me.

Many thanks for your help so far!
I have been quite swamped so had only limited time to look into the text analysis problem.

I have found a very useful package that reads in .docx files into R and assigns IDs, can check for duplicates etc. (GitHub - JBGruber/LexisNexisTools: 📰 Working with newspaper data from 'LexisNexis')

So it actually performs most tasks that we need, somehow I struggle though to generate straight-forward text output after running the package (it generates LNT ouput in three formats: meta, articles and paragraphs). The package itself allows to transform the output to:

rDNA_docs
quanteda_corpus
tCorpus
tidy
Corpus
dbloc

Is one of these easy to convert into plain text?

Thanks again for your help!

Are you planning to do your text analysis in R? It appears that the lnt_convert() function prepares the data for use by other packages, such as tidytext and others, to do text analysis. So if you are planning to use one of these package you may just want to keep the data in R.

For writing the data to plain-text, how do you want the data structured? For example, if you have 10 articles, do you want to save 10 files, one for each article?

Here is one approach to save each article to a plain-text file, where the file is named as the author_date.txt. Let me know what you think...

lnt_sample()

x <- lnt_read('sample.TXT')

meta_df <- x@meta
articles_df <- x@articles

meta_articles_df <- 
  meta_df %>%
  right_join(articles_df, by = "ID")

save_list <- list(author = meta_articles_df$Author,
                  date = meta_articles_df$Date,
                  article = meta_articles_df$Article)

purrr::pwalk(save_list, ~writeLines(text = ..3,
                                    con = glue::glue('{..1}_{..2}.txt')))

Nice your code works pretty well!

Well we have downloaded around 100,000 articles from Nexis, it only allows saving 100 per file so we have quite a lot of files. We want to reduce these and will probably have around 20 categories these will fit into. So in the end we want to have 20 text files.

It is not entirely clear if the analysis will be done in R or in Python, therefore in the 1st step we want to format them uniformly and assign them IDs. (Historically one of the researchers in our group has even used SAS for a similar task and would like the .txt files as well). This allows us to keep more flexibility.

Text structure should be as follows (this is an example, all this data except the ID in the 2nd row and the two ** after the end of the article are included in the original article):

Chrysler will launch ads for the Dodge Caravan that target Asians
3189301
Title: Chrysler steps up Asian-language ads in Western markets
Source: Automotive News, 76 : 8, December 31, 2001. ISSN: 0005-1551
Publisher: Crain Communications Inc.
Document Type: Journal
Record Type: Fulltext Word Count: 317
Publication Country: United States, Language: English
Text:
By: Julie Cantwell
The Chrysler group is getting ready to ring in the new year in California.
In celebration of the Chinese New Year in February, Chrysler will introduce the
second part of its advertising campaign to Asians, adding the San Francisco
area.
The campaign began this fall in Los Angeles with a 30-second TV spot and
newspaper ads promoting the Jeep Liberty in Cantonese and Mandarin. On Feb. 4,
the Chrysler group will add the Dodge Caravan to the campaign, with similar
tactics.
For 2002, the Chrysler group will increase marketing spending to reach Asians
and Hispanics in the West, said Steve Shugg, director of the Chrysler group's
West business center. He would not provide details.
It's more of a brand-building thing,'' Shugg said. The intent is not to sell
cars; it's to provide Hispanic and Asian communities with an experience in a
nonconfrontational way.''
The Chrysler group has struggled in the West, where imports dominate in car
sales and are gaining in trucks. Its 2001 sales in California through September
were up 1 percent over the same period of last year. September sales were 12.2
percent below the figure for September 2000.
To gain strength in the West, Chrysler group marketers realize they must connect
with Asians and Hispanics on the West Coast.
``This will give us a better understanding of Asian-American consumers and
position the company to gain share in this important market,'' said Jeff Bell,
Chrysler group vice president of marketing communications.
Chrysler is trying to reach California's Asians through events as well. In
November, Chrysler for the first time offered a couple of its longstanding
events, Jeep 101 and Chrysler Proving Grounds, in Spanish and Asian languages.
The next Asian-language Jeep 101 is scheduled for Feb. 3 and 4 at the Asian
American Expo in Pomona, Calif.
Imada Wong Communications Group Inc. in Los Angeles is handling the Asian
campaign. Chrysler considers the campaign a pilot program for future attempts to
reach Asians.
Copyright 2001 Crain Communications Inc.
*
Brand Names: Dodge Caravan; Jeep Liberty
Company Names: CHRYSLER CORP
Concept Terms: All market information; Asian American market; Hispanic market;
Marketing campaign
Geographic Area: North America(NOAX); United States(USA)
Industry Names: Automotive
Marketing Terms: All campaign; All media; All product marketing; New campaign;
Newspaper advertising; Positioning-repositioning; TV advertising
Product Names: Passenger cars(371100); Minivans(371188); Sports utility
vehicles, 4-wheel drive(371177)
*
*

I think I have contributed as much as I can. I don't really work with these kinds of data much so I can keep sharing code but ultimately you are going to have to play around with the code to get it to achieve exactly what you want.

Here is some code that does most of what it sounds like you want. If you have ~100,000 articles, spread across 1000 files, each containing 100 articles, this code will loop over every file (if they are saved in a common directory), read them into R, and stack them into a single data frame.

lnt_sample(path = tempdir())

# Replace path with the path pointing to you Nexis files
files_to_read <- fs::dir_ls(path = tempdir(), 
                            type = 'file',
                            glob = '*.TXT')

join_fun <- function(file) {
  file_set <- lnt_read(file)
  
  file_set@meta %>% 
    inner_join(file_set@articles, by = 'ID')
}

data <- purrr::map_dfr(files_to_read, join_fun)

Next, now that we have all of our meta-data and article content in a single data frame, we can prepare how we would like the content to be structured in the output file using glue::glue(). I did this quickly based on your above post, but you will need to add more to make the output data to look like you want.

I only have 10 articles to play with, but I show how to export the data into files in batches of 5. These batches could be replaced with your ~20 categories. Anyway, we end up with two files, each with 5 articles, with the general structure like you described above.

new_data <- 
  data %>% 
  mutate(to_save = glue::glue('Title: {Headline}', 
                              'Source: {Section} {Date}',
                              'Publisher: {Newspaper}',
                              'By: {Author}',
                              '{Article}',
                              '*',
                              '*',
                              .sep = '\n')) %>% 
  mutate(batch = rep(c(1,2), each = 5)) %>% 
  group_by(batch) %>% 
  nest()

purrr::walk2(new_data$batch, new_data$data,
             ~writeLines(..2 %>% pull(to_save),
                         glue::glue('~/Desktop/file{..1}.txt')))

This topic was automatically closed 21 days after the last reply. New replies are no longer allowed.