Text Mining: exclude specific phrases from analysis and split text into sections


I'm working on around 160 separate responses to a survey to carry out simple analysis for an initial summary report, including a wordcloud. All responses have been submitted either in word or PDF format. I have combined into a text file (simply copy and paste into word from the word and PDF documents - I've since found code to merge the PDF documents) and ran R code to produce, after cleaning, a simple wordcloud and some sensitivity analysis.

However, the analysis includes all the text from the template which the respondents were asked to complete, such as the introductory text, instructions, name, company, address, along with all the section headings and questions presented.

As all of this template text will be repeated for each of the 160 responses, it is skewing the frequency of words in the responses.

Q1. Is there a method to exclude not just single words (as per stopwords or creating mystopwords) but full sentances or phrases from the analysis so all text in the template can be ignored and not included as part of the set of responses?

Q2. Also is there a way of segmenting the responses into sections as per the topic headings within the template. Example the structure of the template is Heading 2 - Energy, then a set of questions, then Heading 3 - Energy Transition, another set of questions then Heading 4 - Consumers and questions next. This repeats up to 12 sections. This would permit analysis within sections rather than across all topics.

Q3. Finally, I understand this depends on how I bring the file (word and PDF mix) into R but ideally I'd like to create a tibble as in this example attached, the response column lists all 160 odd response based on the file name then each column indicates the heading with each set of text responses allocated within each heading and respective response.

New enough to R so any help would be greatly appreciated.

My code to bring in the text files, after this it's cleaned up and a simple wordcloud produced.



set pathway to text files

folder<-"C:\xxxxxxxxxx\Text files"

lists all files in pathway


filters text files only

list.files(path=folder, pattern="*.txt")

set vector

filelist<-list.files(path=folder, pattern="*.txt")

assign pathways to files

paste(folder, "\", filelist)

removes separations in pathways by setting as empty

filelist<-paste(folder, "\", filelist, sep="")

apply a function to read in multiple txt files - warnings are OK

a<-lapply(filelist, FUN=readLines)

apply a function to collaspe into a single element

corpus<-lapply(a, FUN=paste, collaspe=" ")



This topic was automatically closed 21 days after the last reply. New replies are no longer allowed.