Reorder bar graph by descending n to show most frequent words of a corpus of TXT files

Rehlein · May 26, 2021, 4:16pm

I have a problem with ordering my bar graph by descending n. The graph is supposed to display the most frequent words in a corpus of TXT files. I'm not sure if I'm reading the files in incorrectly because others have told me that the code for the plot should be working.

# create minimal dataset:
# create two TXT files
# content of first TXT file: aaa bbb ccc
# content of second TXT file: aaa bbb bbb
# save both files to a folder called TXTs in current working directory

# load packages
library("tidyr")
library("dplyr")
library("purrr")
library("readr")
library("tidytext")
library("ggplot2")

# function to read all files from folder into dataframe
read_folder <- function(infolder) {
  tibble(file = dir(infolder, full.names = TRUE)) %>%
    mutate(text = map(file, read_lines)) %>%
    transmute(id = basename(file), text) %>%
    unnest(text)
}

# create corpus from folder with TXT files
raw_text <- read_folder("TXTs")
tidy_text <- raw_text %>%
  group_by(id) %>%
  unnest_tokens(word, text)

# count most frequent words
# and display in descending order
# ATTEMPT #1
tidy_text %>%
  dplyr::count(word, sort = TRUE) %>%
  mutate(word = reorder(word, n)) %>%
  ggplot(aes(x = word, y = n)) +
  geom_col() +
  xlab(NULL) +
  coord_flip()

# count most frequent words
# and display in descending order
# ATTEMPT #2
tidy_text %>%
  dplyr::count(word, sort = TRUE) %>%
  ggplot(aes(x = reorder(factor(word), n), y = n)) +
  geom_col() +
  xlab(NULL) +
  coord_flip()

Neither of these two attempts provide the desired output. The order in the graph should be bbb-aaa-ccc, but it is bbb-ccc-aaa. Thank you!

Leon · May 26, 2021, 4:22pm

Try to play around with fct_reorder(word, desc(n))

Rehlein · May 26, 2021, 5:54pm

Thank you very much for the tip! A friendly Stack Overlow user actually found the key issue. I'm copying their answer to my question here in case anyone is interested.

The problem is your tidy_text tibble is still grouped. I'm actually not sure why you are grouping at all really. I think tidy_text <- raw_text %>% unnest_tokens(word, text) would work just fine. The group_by messes with the mutate() so the reorder can't see all the values.

system · June 16, 2021, 5:54pm

This topic was automatically closed 21 days after the last reply. New replies are no longer allowed.

If you have a query related to it or one of the replies, start a new topic and refer back with a link.