Writing code to do word counts for a large corpus

JPinsky · October 18, 2018, 3:35pm

Thank you to anyone who can lend some guidance-
I'm looking to calculate word frequencies for a database of transcribed interviews that I access online. I would like to do the word frequencies for the words in a dictionary that I have.
I have pasted the dictionary and the code I have so far for analyzing it below (I would reprex but I don't know how). Let me know if you have done something similar before or know a way I could do this.

rm(list=ls())
 
library (stringr)
library (dplyr)
library(tidytext)
library(tidyverse)
library(rvest)

main.page <- read_html(x = "http://www.asapsports.com/show_player.php?category=11&letter=a")
urls <- main.page %>% # feed `main.page` to the next step
    html_nodes("tr+ tr td+ td")   %>% 
    str_sub ( 30, 79) %>%
    str_subset ( "show_player") %>% 
    as.tibble() 
colnames (urls) <- "urls"
names (urls)

links <-  main.page %>% # feed `main.page` to the next step
    html_nodes("tr+ tr td+ td")  %>%
    str_sub ( 30, 100) %>%
    str_subset ( "show_player") %>% 
    str_sub (53, 66) %>%
    str_replace_all ( "\\<", "")%>%
    str_replace_all ( "\\\\", "")%>%
    str_replace_all ( "\\/", "")%>%
    str_replace_all ( "\\.", "")%>%
    str_replace_all ( "\\>", "")%>%
    str_replace_all ( ",", "_")%>%
    str_replace_all ( " ", "")%>%
    str_replace_all ( "\\'", "")%>%
    as.tibble() 
colnames (links) <- "links"
names (links)

sotu <- data.frame(links = links, urls = urls, stringsAsFactors = FALSE)
head(sotu)

View(sotu)

outfilea <- ""
for(i in seq(nrow(sotu))) {
    text <- html(sotu$urls[i]) %>% # load the page
      html_nodes("td td tr:nth-child(1) b a")   %>% 
      html_attr("href") %>% # extract the URLs    html_text()
      as.tibble()  

outfilea <-rbind (outfilea, text)
}

outfilea
colnames (outfilea) <- "url"

text2 <- outfilea %>%   filter(str_detect(url, "http")) %>%
    mutate (id =str_sub(url, -5,-1))

View(outfilea)

outfileb <- ""

for(i in seq(nrow(text2))) {
  text <- read_html(text2$url[i]) %>% # load the page
  html_nodes("tr+ tr tr td~ td+ td") %>% # isloate the text
  html_text() %>%
  as.tibble () %>%
  mutate (id =paste(text2$id[i],"basket", sep = "")) %>% 
  filter (!is.na(value)) %>%
  filter (grepl ( "[a-z]", value))%>%
  filter (!grepl ( "var cx", value))%>%
  filter (!grepl ( "function", value))%>%
  filter (!grepl ( "var gcse", value))%>%
  filter (!grepl ( "gcse.type", value))%>%
  filter (!grepl ( "gcse.async", value))

outfileb <- rbind (outfileb, text)
}

head(outfileb)

carefz1 <- c(" safe", " peace", " compassion", " help", " empath", " sympath", " protect", " secur", " benefit",             " defen", " guard", " care", " caring", " shield", " shelter", " amity", " harm", " suffer", " warl",
           " fight", " violen", " hurt", " killer", " endanger", " cruel", " brutal", " abuse", " damag", " detriment",
           " crush", " attack", " annihilate", " impair", " war", " wars", " warring", " kill", " killing", " ravage",
           " destroy", " stomp", " spurn", " impair")
           
carefz1care <- carefz1 %>%
  as.tibble () %>%
  mutate (category = "care") %>%
  mutate (code = 1)
colnames (carefz1care)<- c("word", "category", "code")

fairfz1 <- c( " fair-", " fairmind", " equal", " justifi", " reciproc", " impartial", " egalitar", " unbias", " balance",
                " unprejudice", " fair", " fairly", " fairness", " fairplay", " justice", " justness", " rights", " equity",
                " evenness", " equivalent", " tolerant", " equable", " homologous", " reasonable", " constant", " unfair",
                " unequal", " bias", " unjust", " injust", " bigot", " discriminat", " disproportion", " prejud", " exclud", 
                " inequitable", " dishonest", " unscrupulous", " dissociate", " preference", " favoritism", " exclusion")

fairfz1fair <- fairfz1 %>%
      as.tibble () %>%
      mutate (category = "fair") %>%
      mutate (code = 1)
colnames (fairfz1fair)<- c("word", "category", "code")

authorityfz1 <- c(" nation", " homeland", " patriot", " commune", " communit", " communis", " comrad", " collectiv", 
                    " unite", " fellow", " devot", " cliqu", " together", " family", " families", " familial", " group", 
                    " communal", " cadre", " joint", " unison", " guild", " solidarity", " member", " cohort", " ally", 
                    " insider", " foreign", " enem", " individual", " deceiv", " deceiv", " jilt", " terroris", " immigra",
                    " imposter", " miscreant", " spy", " sequester", " renegade")
                    
authorityfz1authority <- authorityfz1 %>%
      as.tibble () %>%
      mutate (category = "authority") %>%
      mutate (code = 1)
colnames (authorityfz1authority)<- c("word", "category", "code")

ingroupfz1 <-  c(" obey", " obedien", " duti", " honor", " respectful", " order", " father", " mother", " tradition",
                   " hierarch", " authorit", " status", " rank", " leader", " caste", " complian", " submi", " allegian",
                   " defere", " revere", " venerat", " duty", " law", " respect", " permit", " permission", " class", 
                 " bourgeoisie", " position", " command", " supremacy", " control", " serve", " abide", " comply",
                   " defian", " rebel", " dissent", " subver", " disrespect", " disobe", " sediti", " agitat", " insubordinat",
                   " illegal", " lawless", " defy", " riot", " insurgent", " mutinous", " dissident", " unfaithful",
                   " alienate", " defector", " nonconformist", " oppose", " protest", " refuse", " denounce", " remonstrate", " obstruct")

ingroupfz1ingroup <- ingroupfz1 %>%
      as.tibble () %>%
      mutate (category = "ingroup") %>%
      mutate (code = 1)
colnames (ingroupfz1ingroup)<- c("word", "category", "code")

purityfz1 <- c(" pure", " clean", " steril", " sacred", " chast", " saint", " celiba", " abstinen",
                 " church", " purity", " holy", " holiness", " abstention", " virgin", " austerity", " modesty",
                 " abstemiousness", " limpid", " unadulterated", " maiden", " virtuous", " refined", " immaculate",
                 " innocent", " pristine", " disgust", " deprav", " disease", " unclean", " contagio", " sinful",
                " sinner", " slut", " dirt", " profan", " repuls", " sick", " promiscu", " lewd", " adulter", 
                 " debauche", " defile", " prostitut", " filth", " obscen", " taint", " stain", " tarnish", 
                 " debase", " desecrat", " exploitat", "  sin", " whore", " impiety", " impious", " gross",
                 " tramp", " unchaste", " intemperate", " wanton", " profligate", " trashy", " lax", " blemish", " pervert")

purityfz1purity <- purityfz1 %>%
      as.tibble () %>%
      mutate (category = "purity") %>%
      mutate (code = 1)
colnames (purityfz1purity)<- c("word", "category", "code")

foundationsdictionary <- rbind (purityfz1purity, ingroupfz1ingroup, authorityfz1authority,
                                  fairfz1fair,carefz1care )

mara · October 18, 2018, 3:42pm

Have you looked at the "Analyzing word and document frequency" chapter of Tidy Text Mining? It takes you through the process of getting word counts step-by-step in a really nice way.

Right now, you have quite a bit of code here, and it's not immediately clear which part is problematic— a key piece of the minimal reproducible example. (Note also that it's best to refrain from including rm(list=ls()) in your example, as that would remove everything from the environment of anyone else trying to reproduce your issue).

For learning how to make a reprex, check out the community reprex FAQ. There's also a video tutorial that will take you through the process:

JPinsky · October 18, 2018, 6:01pm

Thank you for the response- I'm new to this so I truly appreciate the help!
I have read that chapter of the TidyText mining but my corpus is online. I want to scrape the data from the website I have and then conduct the word frequency count on the content I gather. So I think once I am able to do that, I can use the chapters from the TidyText mining book to do the word frequency count.
As for my code, the bottom half is just the dictionary I want to use. I don't know if that makes the code I have more digestible

mara · October 18, 2018, 6:20pm

Can you isolate the part you're having trouble with? With reprex, you can see the input and output of the code, without it, someone else has to run what you have to figure out what works, and what doesn't.

This is a really useful guide to scraping with rvest (there's also a thread here on community with some really useful tutorials):

JPinsky · October 18, 2018, 6:28pm

Hmmm, okay. I see what you mean. I'll look through the resources you've posted (reprex and rvest) and see if I can present my problem better.
Thank you for the help!
Speak to you soon