I have a dataset containing multiple .text documents, how would I restructure it in the one-token-per-row format using unnest_tokens()?

Hey guys, I'm doing a project where I analyse Trumps speeches (text analysis).

My code looks like this:

# Read the files in
# lapply function returns a list the same length as the txt_files_ls
# Create a dataframe by reading in the table 
# Set the header to "F" as we will be adding this in later
# Separate the data using "sep="\t"" this means the data is tab delimited and from seperate documents
# read.table("file.txt", header=T/F, sep="\t") is an alternative to read.delim
txt_files_df_list <- lapply(txt_files_ls, function(x) {data.frame(read.table(file = x, header = F, sep ="\t", colnames(x)))})

# Combine them and set the column name to speech using the setName function 
# The do.call function constructs and executes a function call from a name or function in this case "r.bind"
combined_df <- setNames(do.call("rbind", txt_files_df_list),

# Create an R object for the locations of speeches, listing them in the same order as they were inputted into the list 
location <- c("Bemidji", "Fayetteville", "Freeland", "Henderson", "Latrobe", "Minden", "Mosinee", "Ohio", "Pittsburgh", "Winston-Salem" )

# Using the dplyr package and the function mutate add in the new R object of the locations and create a new dataframe
combined_df_2 <- mutate(combined_df, Location= location)

# Create an R object for the dates of the speeches extracted from the file titles, place them in the same order as they were inputted into the list 
date <- c("2020-09-18", "2020-09-19", "2020-09-10", "2020-09-13", "2020-09-03", "2020-09-12", "2020-09-17", "2020-09-21", "2020-09-22", "2020-09-08")

# Transform the data into date data using the as_date function and adding the format of which the date is written 
date_2<- lubridate::as_date(date, '%Y-%m-%d')

# Again using the dplyr package and the mutate function add in the new R object of the dates with the new format of data
combined_df_3 <- mutate(combined_df_2, Date=date_2)

# Seeing the structure of the combined dataset to check that the speech and location columns are characters and the date column is date


My question is how would I break the text in to individual tokens and transform it to a tidy data structure.
How would I tokenize the dialogue, splitting each sentence in separate words?

When I try to do it myself with the code:

test_df <- combined_df_3 %>% 
  unnest_tokens(word, combined_df_3$Speech) 

I get the error :

Any guidance would be appreciated!
Also, if there's a way to somehow make my original code smaller, where I extract the name and date of the file name and put them into individual columns which contains the content of file(Speech), Location and date columns. That would also be helpful!


I think you must only supply the column name as input argument in unnest_tokens. So this should work:


Does that help?

1 Like

This topic was automatically closed 7 days after the last reply. New replies are no longer allowed.

If you have a query related to it or one of the replies, start a new topic and refer back with a link.

I can’t believe I missed out on something so simple. I’ve been trying to figure it out for hours. Thank you!