HELP !! I can't figure out how to do a simple bar chart.

dumbaf · August 7, 2021, 5:01pm

Hey guys! My dissertation is due very very soon and I am s t r u g g l i n g with R Studio. I would like to create a bar chart of some categorical variables that I have gotten through quanteda. I am a total beginner and have been trying all day to make this and don't really have the time to spare to keep looking at youtube tutorials.

I need to make exactly this (but with my data - also I hope Christine doesn't mind my using this as a demonstration, I took it from her R query here) :

My question is, how did she create the fill variables, "genhlth" ???

I have tried merging columns from the data set that I am using with the "cbind()" function, but I assume the issue is that they need to be assigned to each variable. I know it's hard to explain but my code is terrible, so I don't think it will help. Can someone just explain to me the basics of getting that kind of bar chart from a categorical dataset? Unfortunately the tutorials only explain how to create it, not how to create the identifying variables. I am very embarassed as I am sure this is very simple, but will be forever grateful for any sort of help possible!!!

Sincerely,

Desperate

FJCC · August 7, 2021, 5:25pm

The data set has to have a value for the marital and genhlth in every row. Without knowing the structure of the data you have, I cannot give any guidance on how to get from where you are to where you need to be. Please post the output of

dput(head(DF))

where DF is the name of the variable containing your data. Put a line with three back ticks just before and just after your output.
```
Your output
```

Riffomonas · August 7, 2021, 5:27pm

Hi there -

If you look at the code that generated their figure the first line is...

ggplot(sleep_cleaned, aes(x=marital, fill=genhlth))

This is using a data frame called sleep_cleaned, which has a column marital and one named genhlth. The geom_bar function is counting the number of times each rating shows up in the genhlth column. For your purposes you need to have a single column with all of the data you want in it. So, if you have multiple columns for each level of the variable, those need to be "tidied" into one column, likely using something like pivot_longer.

I'd strongly encourage you to check out online resources for learning the tidyverse. I have a bias, but I think this series of tutorials would be of use to you in developing your R skills with the tidyverse.

Good luck and welcome to the R community!
Pat

dumbaf · August 7, 2021, 5:40pm

Thank you so much for the quick reply!!! This is what I have got:

 care = c(15, 18, 17, 11, 23, 7), fairness = c(7, 1, 9, 3, 
1, 1), loyalty = c(10, 9, 11, 9, 9, 5), authority = c(17, 2, 
22, 4, 6, 4), sanctity = c(1, 2, 11, 2, 11, 3), newspaper = c("Guardian", 
"Guardian", "Guardian", "Guardian", "Guardian", "Guardian")), row.names = c(NA, 
6L), class = "data.frame")

What I have so far for my simple barplot is this code:

ggplot(data = merged_data, aes(x= newspaper)) + geom_bar()

This gives me the total values for each newspaper in my dataset (Guardian, Daily Mail, Telegraph, Mirror).
However, each one of these has different word counts for each foundation ("care", "fairness", "loyalty", "authority", "sanctity"). That is what I would like to put in the bar chart - my foundations would replace the "genhlth" of my previous example. I'm just not sure how to make R understand this. Thank you a million trillion times for your help!!!!!

FJCC · August 7, 2021, 5:52pm

I just saw that the data need to be totaled before plotting. I'll be back in a few minutes!

There are two things you need to change. You need to pivot the data from the wide format that you have to a long format. I did that with the pivot_longer function from the package tidyr. Notice the difference between the data you have, which I called DF, and the pivoted data that I called DFlong. Second, you should use the geom_col() function for the plot rather than geom_bar(). The difference is that geom_bar counts how many times each variable combination appears in the data. geom_col() is used when you already have a column of counts, as in your data.

library(ggplot2)
#> Warning: package 'ggplot2' was built under R version 4.0.5
library(tidyr)
DF <- data.frame(care = c(15, 18, 17, 11, 23, 7), 
                 fairness = c(7, 1, 9, 3, 1, 1), 
                 loyalty = c(10, 9, 11, 9, 9, 5), 
                 authority = c(17, 2, 22, 4, 6, 4), 
                 sanctity = c(1, 2, 11, 2, 11, 3), 
                 newspaper = c("Guardian", "Guardian", "Guardian", "Guardian", "Guardian", "Guardian"))
DF
#>   care fairness loyalty authority sanctity newspaper
#> 1   15        7      10        17        1  Guardian
#> 2   18        1       9         2        2  Guardian
#> 3   17        9      11        22       11  Guardian
#> 4   11        3       9         4        2  Guardian
#> 5   23        1       9         6       11  Guardian
#> 6    7        1       5         4        3  Guardian
DFlong <- pivot_longer(data = DF, cols = care:sanctity, names_to = "Foundation", values_to = "Count")
DFlong
#> # A tibble: 30 x 3
#>    newspaper Foundation Count
#>    <chr>     <chr>      <dbl>
#>  1 Guardian  care          15
#>  2 Guardian  fairness       7
#>  3 Guardian  loyalty       10
#>  4 Guardian  authority     17
#>  5 Guardian  sanctity       1
#>  6 Guardian  care          18
#>  7 Guardian  fairness       1
#>  8 Guardian  loyalty        9
#>  9 Guardian  authority      2
#> 10 Guardian  sanctity       2
#> # ... with 20 more rows
ggplot(DFlong, aes(x=newspaper, y = Count, fill = Foundation)) + geom_col(position = "dodge")

^{Created on 2021-08-07 by the reprex package (v0.3.0)}

FJCC · August 7, 2021, 5:58pm

Here is a better answer.

library(ggplot2)
#> Warning: package 'ggplot2' was built under R version 4.0.5
library(tidyr)
library(dplyr)
#> 
#> Attaching package: 'dplyr'
#> The following objects are masked from 'package:stats':
#> 
#>     filter, lag
#> The following objects are masked from 'package:base':
#> 
#>     intersect, setdiff, setequal, union
DF <- data.frame(care = c(15, 18, 17, 11, 23, 7), 
                 fairness = c(7, 1, 9, 3, 1, 1), 
                 loyalty = c(10, 9, 11, 9, 9, 5), 
                 authority = c(17, 2, 22, 4, 6, 4), 
                 sanctity = c(1, 2, 11, 2, 11, 3), 
                 newspaper = c("Guardian", "Guardian", "Guardian", "Guardian", "Guardian", "Guardian"))
DF
#>   care fairness loyalty authority sanctity newspaper
#> 1   15        7      10        17        1  Guardian
#> 2   18        1       9         2        2  Guardian
#> 3   17        9      11        22       11  Guardian
#> 4   11        3       9         4        2  Guardian
#> 5   23        1       9         6       11  Guardian
#> 6    7        1       5         4        3  Guardian
DFlong <- pivot_longer(data = DF, cols = care:sanctity, names_to = "Foundation", values_to = "Count")
DFlong
#> # A tibble: 30 x 3
#>    newspaper Foundation Count
#>    <chr>     <chr>      <dbl>
#>  1 Guardian  care          15
#>  2 Guardian  fairness       7
#>  3 Guardian  loyalty       10
#>  4 Guardian  authority     17
#>  5 Guardian  sanctity       1
#>  6 Guardian  care          18
#>  7 Guardian  fairness       1
#>  8 Guardian  loyalty        9
#>  9 Guardian  authority      2
#> 10 Guardian  sanctity       2
#> # ... with 20 more rows
DFlongSum <- DFlong %>% group_by(newspaper, Foundation) %>% 
  summarize(Total = sum(Count))
#> `summarise()` regrouping output by 'newspaper' (override with `.groups` argument)
DFlongSum
#> # A tibble: 5 x 3
#> # Groups:   newspaper [1]
#>   newspaper Foundation Total
#>   <chr>     <chr>      <dbl>
#> 1 Guardian  authority     55
#> 2 Guardian  care          91
#> 3 Guardian  fairness      22
#> 4 Guardian  loyalty       53
#> 5 Guardian  sanctity      30
ggplot(DFlongSum, aes(x=newspaper, y = Total, fill = Foundation)) + geom_col(position = "dodge")

^{Created on 2021-08-07 by the reprex package (v0.3.0)}

dumbaf · August 7, 2021, 6:19pm

THANK YOU SO MUCH! Really, thank you thank you thank you!

One final thing, my data set for each newspaper contains many observations. In the Guardian for example, I have analysed 188 documents (see corpus below). When creating the vectors for the new data frame with the command c() , do I have to input each value manually, or is there a more obvious way of doing it? Sorry for all these questions, you've been so much help already, I don't think I could have come up with this by myself!

guardian_mf <- dfm(guardian_corpus, dictionary = data_dictionary_MFD_GB)
guardian_mf_df <- convert(guardian_mf, to = "data.frame")

mail_mf <- dfm(mail_corpus, dictionary = data_dictionary_MFD_GB)
mail_mf_df <- convert(mail_mf, to = "data.frame")

telegraph_mf <- dfm(telegraph_corpus, dictionary = data_dictionary_MFD_GB)
telegraph_mf_df <- convert(telegraph_mf, to = "data.frame")

mirror_mf <- dfm(mirror_corpus, dictionary = data_dictionary_MFD_GB)
mirror_mf_df <- convert(mirror_mf, to = "data.frame")

No worries if it's too much of a bother, really, I am so grateful already! What a great community!

FJCC · August 7, 2021, 6:31pm

I am not familiar with those function you are using so I might lead you astray. It would be better to start a separate topic, which will be more likely to attract the attention of people with the right knowledge.

system · August 28, 2021, 6:32pm

This topic was automatically closed 21 days after the last reply. New replies are no longer allowed.

If you have a query related to it or one of the replies, start a new topic and refer back with a link.