Reorder bar graph by descending n to show most frequent words of a corpus of TXT files

I have a problem with ordering my bar graph by descending n. The graph is supposed to display the most frequent words in a corpus of TXT files. I'm not sure if I'm reading the files in incorrectly because others have told me that the code for the plot should be working.

# create minimal dataset:
# create two TXT files
# content of first TXT file: aaa bbb ccc
# content of second TXT file: aaa bbb bbb
# save both files to a folder called TXTs in current working directory

# load packages
library("tidyr")
library("dplyr")
library("purrr")
library("readr")
library("tidytext")
library("ggplot2")

# function to read all files from folder into dataframe
read_folder <- function(infolder) {
  tibble(file = dir(infolder, full.names = TRUE)) %>%
    mutate(text = map(file, read_lines)) %>%
    transmute(id = basename(file), text) %>%
    unnest(text)
}

# create corpus from folder with TXT files
raw_text <- read_folder("TXTs")
tidy_text <- raw_text %>%
  group_by(id) %>%
  unnest_tokens(word, text)

# count most frequent words
# and display in descending order
# ATTEMPT #1
tidy_text %>%
  dplyr::count(word, sort = TRUE) %>%
  mutate(word = reorder(word, n)) %>%
  ggplot(aes(x = word, y = n)) +
  geom_col() +
  xlab(NULL) +
  coord_flip()

# count most frequent words
# and display in descending order
# ATTEMPT #2
tidy_text %>%
  dplyr::count(word, sort = TRUE) %>%
  ggplot(aes(x = reorder(factor(word), n), y = n)) +
  geom_col() +
  xlab(NULL) +
  coord_flip()

Neither of these two attempts provide the desired output. The order in the graph should be bbb-aaa-ccc, but it is bbb-ccc-aaa. Thank you!

Try to play around with fct_reorder(word, desc(n)) :slightly_smiling_face:

1 Like

Thank you very much for the tip! A friendly Stack Overlow user actually found the key issue. I'm copying their answer to my question here in case anyone is interested.

The problem is your tidy_text tibble is still grouped. I'm actually not sure why you are grouping at all really. I think tidy_text <- raw_text %>% unnest_tokens(word, text) would work just fine. The group_by messes with the mutate() so the reorder can't see all the values.

This topic was automatically closed 21 days after the last reply. New replies are no longer allowed.

If you have a query related to it or one of the replies, start a new topic and refer back with a link.