What data type for a frequency list, and how to intersect those lists?

Hello

I'm struggling to understand how to use vectors/lists/data frames, because I'm used to a language like JavaScript where I can reference key-value pairs in an object.

The first thing I'm trying to do is get a frequency list (i.e. words separated by whitespace) from a text, let's say Text 1. So I'd expect to have something like:

Word Freq
the 4
cat 2
follows 1

And so on. I thought I could achieve this with a list(), where each word would be the names(), but this didn't work because I couldn't insert the value at the list's [key] or [[key]]. It would also be useful to have a third piece of information, which would be the rank in terms of frequency:

(A) Text 1 Frequency List

Rank Word Freq
1 the 4
2 cat 2
3 follows 1

But I figure this should be accessible without manually listing it. So to achieve this, do I need something like a dataframe? It seems overly complicated for something so simple.

The second issue is comparing Text 1's top words with Text 2. Since my attempts with lists didn't work, I couldn't manually make the frequency list the way I intended, so I followed a tutorial that works like this:

l = strsplit(text, "\\W+") # Splits the text into a list, separated by non-word characters
l = unlist(l) # unlists it
l = table(l) # makes a table
l = sort(l, descending=T); # sorts it

This table has two issues. Firstly, it sorts by alphabetical word order, not word frequency. Secondly, how can I compare it with a table based on Text 2?

What I want to do is find the top n most frequent terms in Text 1, and then find the frequency of those terms in Text 2. For example:

(B) Text 1 Frequencies

Rank Word Freq
1 the 4
2 cat 2
...
10 eats 1

(C) Text 2 Frequencies

Rank Word Freq
1 the 3
2 dog 1
...
44 eats 6

I want to get the frequency list for Text 1 (let's call it freq.text1), and sort by frequency to get the top N words (text1.topwords). Then I want a frequency list for Text 2 (freq.text2). Then I want a new list/vector/whatever that gives me the intersection of the top words in text1.topwords and freq.text2, so (B) and (C) would become:

(D) Text 1 top words' frequency in Text 2

Text1Rank Word Text2Freq
1 the 3
2 cat 0
...
10 eats 6

Once I have something like this, I want to be able to do operations on the values, e.g. a word's frequency in Text1 / Text2 etc. So is the best way to achieve this to include everything in one table? For example:

Word Text1Rank Text1Freq Text2Rank Text2Freq

Or, should I be storing all the words as a vector, all the ranks as a vector etc, and only comparing them by taking their index? E.g.

words=c("the", "cat", "eats");
Text1Freq= … ;
Text2Freq = … ;

So then I'd have to write a function that finds the intersection of the word list vector for each text, then output a vector that takes the index of those intersected words from text 2, then input that index vector into the frequency vector for text 2… this is where I'm getting lost!

To summarise, what data type should I be using that will allow me to store and sort the difference variables associated with a frequency list? Can I compare one with another to find the intersection between Text 1 and Text 2?

Edit: Sorry, it mangled the formatting where I had tabbed the examples

I would do all your work in data frames / tibbles. That's what most of R's functionality is meant to work with. It will give you the freedom to sort, pivot, group_by, aggregate, etc.

library(tidyverse)

split_words <- function(s) {
  # function to split a text string into words
  s %>%    
    str_remove_all("[\n.,]") %>%
    tolower() %>%
    strsplit(" ") %>%
    unlist() 
}

# one set of lyrics
lamb <- "Mary had a little lamb, Little lamb, little lamb, Mary had a little 
lamb, Its fleece was white as snow. And everywhere that Mary went, 
Mary went, Mary went, Everywhere that Mary went, The lamb was sure to 
go. It followed her to school one day, School one day, school one 
day, It followed her to school one day, That was against the rule."

tibble(word = split_words(lamb)) %>%
  group_by(word) %>%
  summarize(ct = n()) %>%
  arrange(-ct)
#> # A tibble: 27 x 2
#>    word      ct
#>    <chr>  <int>
#>  1 mary       6
#>  2 lamb       5
#>  3 day        4
#>  4 little     4
#>  5 one        4
#>  6 school     4
#>  7 went       4
#>  8 that       3
#>  9 to         3
#> 10 was        3
#> # ... with 17 more rows

# two sets
lyrics_list <- lst(
  lamb,
  wind = "After all jacks are in their boxes
    And the clowns have all gone to bed
    You can hear happiness staggering on down the street
    Footprints dressed in red
    And the wind whispers Mary
    A broom is drearily sweeping
    Up the broken pieces of yesterday's life
    Somewhere a queen is weeping
    Somewhere a king has no wife
    And the wind, it cries Mary"
)

tibble(
  src = names(lyrics_list),
  # words is a nested tibble
  words = map(lyrics_list, ~tibble(word = split_words(.))) 
) %>%
  unnest(words) %>%
  # count across source and word
  group_by(src, word) %>%
  summarize(ct = n(), .groups = "drop") %>%
  # put source into columns
  pivot_wider(names_from = src, values_from = ct) %>%
  mutate_if(is.numeric, replace_na, 0) %>%
  arrange(-lamb)
#> # A tibble: 70 x 3
#>    word    lamb  wind
#>    <chr>  <dbl> <dbl>
#>  1 mary       6     2
#>  2 lamb       5     0
#>  3 day        4     0
#>  4 little     4     0
#>  5 one        4     0
#>  6 school     4     0
#>  7 went       4     0
#>  8 that       3     0
#>  9 to         3     1
#> 10 was        3     0
#> # ... with 60 more rows

Created on 2021-06-04 by the reprex package (v1.0.0)

This topic was automatically closed 21 days after the last reply. New replies are no longer allowed.

If you have a query related to it or one of the replies, start a new topic and refer back with a link.