Hello
I'm struggling to understand how to use vectors/lists/data frames, because I'm used to a language like JavaScript where I can reference key-value pairs in an object.
The first thing I'm trying to do is get a frequency list (i.e. words separated by whitespace) from a text, let's say Text 1. So I'd expect to have something like:
Word Freq
the 4
cat 2
follows 1
And so on. I thought I could achieve this with a list(), where each word would be the names(), but this didn't work because I couldn't insert the value at the list's [key] or [[key]]. It would also be useful to have a third piece of information, which would be the rank in terms of frequency:
(A) Text 1 Frequency List
Rank Word Freq
1 the 4
2 cat 2
3 follows 1
But I figure this should be accessible without manually listing it. So to achieve this, do I need something like a dataframe? It seems overly complicated for something so simple.
The second issue is comparing Text 1's top words with Text 2. Since my attempts with lists didn't work, I couldn't manually make the frequency list the way I intended, so I followed a tutorial that works like this:
l = strsplit(text, "\\W+") # Splits the text into a list, separated by non-word characters
l = unlist(l) # unlists it
l = table(l) # makes a table
l = sort(l, descending=T); # sorts it
This table has two issues. Firstly, it sorts by alphabetical word order, not word frequency. Secondly, how can I compare it with a table based on Text 2?
What I want to do is find the top n most frequent terms in Text 1, and then find the frequency of those terms in Text 2. For example:
(B) Text 1 Frequencies
Rank Word Freq
1 the 4
2 cat 2
...
10 eats 1
(C) Text 2 Frequencies
Rank Word Freq
1 the 3
2 dog 1
...
44 eats 6
I want to get the frequency list for Text 1 (let's call it freq.text1), and sort by frequency to get the top N words (text1.topwords). Then I want a frequency list for Text 2 (freq.text2). Then I want a new list/vector/whatever that gives me the intersection of the top words in text1.topwords and freq.text2, so (B) and (C) would become:
(D) Text 1 top words' frequency in Text 2
Text1Rank Word Text2Freq
1 the 3
2 cat 0
...
10 eats 6
Once I have something like this, I want to be able to do operations on the values, e.g. a word's frequency in Text1 / Text2 etc. So is the best way to achieve this to include everything in one table? For example:
Word Text1Rank Text1Freq Text2Rank Text2Freq
Or, should I be storing all the words as a vector, all the ranks as a vector etc, and only comparing them by taking their index? E.g.
words=c("the", "cat", "eats");
Text1Freq= … ;
Text2Freq = … ;
So then I'd have to write a function that finds the intersection of the word list vector for each text, then output a vector that takes the index of those intersected words from text 2, then input that index vector into the frequency vector for text 2… this is where I'm getting lost!
To summarise, what data type should I be using that will allow me to store and sort the difference variables associated with a frequency list? Can I compare one with another to find the intersection between Text 1 and Text 2?
Edit: Sorry, it mangled the formatting where I had tabbed the examples