I think I have contributed as much as I can. I don't really work with these kinds of data much so I can keep sharing code but ultimately you are going to have to play around with the code to get it to achieve exactly what you want.
Here is some code that does most of what it sounds like you want. If you have ~100,000 articles, spread across 1000 files, each containing 100 articles, this code will loop over every file (if they are saved in a common directory), read them into R
, and stack them into a single data frame.
lnt_sample(path = tempdir())
# Replace path with the path pointing to you Nexis files
files_to_read <- fs::dir_ls(path = tempdir(),
type = 'file',
glob = '*.TXT')
join_fun <- function(file) {
file_set <- lnt_read(file)
file_set@meta %>%
inner_join(file_set@articles, by = 'ID')
}
data <- purrr::map_dfr(files_to_read, join_fun)
Next, now that we have all of our meta-data and article content in a single data frame, we can prepare how we would like the content to be structured in the output file using glue::glue()
. I did this quickly based on your above post, but you will need to add more to make the output data to look like you want.
I only have 10 articles to play with, but I show how to export the data into files in batches of 5. These batches could be replaced with your ~20 categories. Anyway, we end up with two files, each with 5 articles, with the general structure like you described above.
new_data <-
data %>%
mutate(to_save = glue::glue('Title: {Headline}',
'Source: {Section} {Date}',
'Publisher: {Newspaper}',
'By: {Author}',
'{Article}',
'*',
'*',
.sep = '\n')) %>%
mutate(batch = rep(c(1,2), each = 5)) %>%
group_by(batch) %>%
nest()
purrr::walk2(new_data$batch, new_data$data,
~writeLines(..2 %>% pull(to_save),
glue::glue('~/Desktop/file{..1}.txt')))