Find frequency of words

Can someone please help me find the frequency of words from excel .csv file. I want frequency from each column separately except first column and then want to correlate frequency of words with the first column score (which starts from 7 and ends to 9). Here is how data looks like:

squads <- tibble::tribble(
            ~`Q4_1__Category-_Liking_attribute`, ~Q8__1___COMMENTS, ~Q8__2___COMMENTS,         ~Q8__3___COMMENTS,   ~Q8__4___COMMENTS,         ~Q8__5___COMMENTS,
                                             7L,     "Good flavor",       "Off color",                  "Smooth",        "Whipped ok",                    "Warm",
                                             8L,          "Smooth",          "Creamy",               "Wholesome",           "Natural",                 "Organic",
                                             9L,       "Wholesome",         "Natural",               "Delicious",           "Healthy",                   "Tasty",
                                             9L,       "Different",       "Wholesome",                 "Natural",             "Tasty",                 "Organic",
                                             8L,           "Plain",        "Potatoey",                   "Tasty",           "Natural",                "Homemade",
                                             7L,            "Good",          "Chunky",               "Flavorful",         "Authentic",                   "Tasty",
                                             7L,           "Thick",           "Tasty",               "Authentic",     "Very potatoey",                    "Good",
                                             7L,          "Purple",     "Interesting",                     "Fun",         "Different",                 "Unusual",
                                             7L,           "White",             "Hot",                  "Mashed",            "Smooth",                  "Creamy",
                                             8L,        "Colorful",          "Bright",                   "Tasty", "Real potato taste", "Worthy of Sunday dinner",
                                             8L,            "Bold",       "Flavorful", "Tastes like real potato",               "Hot",                    "Good",
                                             7L,             "Hot",          "Smooth",                  "Creamy",           "Potatey",                   "White",
                                             8L,        "Colorful",     "Interesting",                "Creative",              "Bold",                 "Unusual"
            )
head(sqauds)
#> Error in head(sqauds): object 'sqauds' not found

Code:

library(tidyverse)
library(tidytext)
library(tm)
#> Loading required package: NLP
#> 
#> Attaching package: 'NLP'
#> The following object is masked from 'package:ggplot2':
#> 
#>     annotate
library(dplyr)
library(qdap)
#> Loading required package: qdapDictionaries
#> Loading required package: qdapRegex
#> 
#> Attaching package: 'qdapRegex'
#> The following object is masked from 'package:dplyr':
#> 
#>     explain
#> The following object is masked from 'package:ggplot2':
#> 
#>     %+%
#> Loading required package: qdapTools
#> 
#> Attaching package: 'qdapTools'
#> The following object is masked from 'package:dplyr':
#> 
#>     id
#> Loading required package: RColorBrewer
#> Error: package or namespace load failed for 'qdap':
#>  .onLoad failed in loadNamespace() for 'rJava', details:
#>   call: fun(libname, pkgname)
#>   error: JAVA_HOME cannot be determined from the Registry
df <- read.csv("C:/Users/cs/Downloads/review.csv", header=T)
colnames(df)[1] = "score"
colnames(df)[2] = "First_response"
colnames(df)[3] = "Second_response"
colnames(df)[4] = "Third_response"
colnames(df)[5] = "Fourth_response"
colnames(df)[6] = "Fifth_response"

Created on 2022-04-24 by the reprex package (v2.0.1)

Created on 2022-04-24 by the reprex package (v2.0.1)

Here are transformations using only tidyverse

(squads_long <- mutate(squads, rn = row_number()) %>%
  pivot_longer(cols = where(is.character),
               names_to = "question", values_to = "answer") %>% 
    separate_rows(answer))

(per_q_word_frq <- group_by(squads_long, question, answer) %>% summarise(frq = n()))

(together <- left_join(squads_long, per_q_word_frq,
  by = c("question", "answer")
))

I dont understand how to interpret what you said about correlation so I've not gone there. Perhaps you can say more about it.

2 Likes

I am getting an error, when I run this code:

> Error in mutate(squads, rn = row_number()) : object 'squads' not found

If I am missing some package updates or else?

you had inconsistent spelling in your post.
The first time squads the second time sqauds. I picked the first one to go with, review that.

Sorry, I didn't get it. What do you mean?

your very first line of shared code was :

squads <- tibble::tribble(

therefore my code assumes that your starting dataset is called squads.
you can change the name

1 Like

Thanks, I got it. It ran and I got the frequencies. Do you know if how I can get a table (in console) of frequencies in descending or ascending order. I want to see the highest frequencies.

slice_max(together,frq,n=10) 

for top 10.
change max to min for bottom

1 Like

@nirgrahamuk Thanks for the help. How I can I get all the data because console only shows few rows, like

> # ... with 24 more rows>

For correlation, I was wondering if somehow I can show correlation in a scatterplot diagram where x-axis would be like 7 to 9 and words correlating with 7 would show near the origin and other words correlating with, say 9 would show farther to 7, in the top right corner. Like this:

image

This topic was automatically closed 21 days after the last reply. New replies are no longer allowed.

If you have a query related to it or one of the replies, start a new topic and refer back with a link.