Paired boxplot of two categorical variables, using ggplot and tidyverse

Data: I have a list of specimens that have been taxonomically identified (noted as 'Genus", and assigned as either Maastrichtian or Danian in age (noted as 'Preliminary ID'). Example: Specimen #5 is identified as Adocus, and its from the Danian. There's about 15 different genera, and two age options.

Goal: I would like to make a boxplot that pairs the overall percentage that each genus makes up of the Maastrichtian assemblage, and the Danian Assemblage. I have way more Maastrichtian specimens than Danian ones, so I need to normalize this data (ie. Adocus makes up 10% of total Danian turtles, Aspideretoides makes up 20% of total Maastrichtian samples, etc)

I'm using this tutorial: Using ggplot to create bar charts for 2 categorical variables. R programming for beginners. - YouTube

Here is my code:

my_data%>%
  drop_na(`Preliminary ID`)%>%
  drop_na(Genus)%>%
  ggplot(aes(Genus,fill(`Preliminary ID`)))
  geom_bar(position="dodge",
             alpha(0.5))+
  labs(title="Maastrichtian Vs Danian Turtle Assemblage",
       x="Genus",
       y="Number")+
    theme(axis.text.x = element_text(angle = 90, vjust = 0.5, hjust=1))
`` `
**Error Message:** 

my_data%>%

  • drop_na(Preliminary ID)%>%
  • drop_na(Genus)%>%
  • ggplot(aes(Genus,fill(Preliminary ID)))
    Error in geom_blank():
    ! Problem while computing aesthetics.
    :information_source: Error occurred in the 1st layer.
    Caused by error in UseMethod():
    ! no applicable method for 'fill' applied to an object of class "character"
    Run rlang::last_error() to see where the error occurred.

hard to say without seeing what this looks like. Try making

a factor and be kind to yourself as an analyst—no embedded blanks. Use short names just long enough to keep the meaning in mind. Replace with the more discursive labels when time to make a presentation table.

As a follow-up to technocrat

A handy way to supply some sample data is the dput() function. In the case of a large dataset something like dput(head(mydata, 100)) should supply the data we need. Just do dput(mydata) where mydata is your data. Copy the output and paste it here. In this case, mae sure that the sample contains both Maastrichtian specimens and Danian ones

2 Likes

This is a useful tip, thank you. Unfortunately, the data includes sensitive locality information that I don't think I'm allowed to share too much publicly, so I've only provided the first 10 datapoints.
I have a filter command that drops all missing data, which I couldn't get to work with the dput command, but that's neither here nor there.
"Preliminary ID" was renamed to "InSituAge" for clarity, but otherwise everything else is the same.

structure(list(Specimen Number = c("MOR 10851", "MOR 089",
"MOR 1148", "MOR 1149", "MOR 3013", "MOR 3097", NA, "MOR9735",
NA, NA), Site # = c("HC530", NA, "HC 304", "HC250", "HC31",
"HC 743", "HC977", "HC1145", "HC996", "HC996"), Conservative ID = c(NA,
NA, NA, NA, NA, NA, NA, NA, "Trionychidae", "Trionychidae"),
Genus = c("Atopsemys", "Compsemys", "Emarginochelys", "Helopanoplia",
"Helopanoplia", NA, "Derrisemys", "Basilemys", "Hutchemys",
"Helopanoplia"), Species = c("superstes", "victa", NA, NA,
NA, NA, NA, NA, NA, NA), Formation = c("Hell Creek", NA,
"Hell Creek", "Hell Creek", "Hell Creek", "Hell Creek", "Hell Creek",
"Hell Creek", "Hell Creek", "Hell Creek"), DepositionalAge = c("Maastrichtian",
NA, "Maastrichtian", "Maastrichtian", "Maastrichtian", "Maastrichtian",
"Maastrichtian", "Maastrichtian", "Maastrichtian", "Maastrichtian"
), InSituAge = c("Maastrichtian", NA, "Maastrichtian", "Maastrichtian",
"Maastrichtian", "Maastrichtian", "Maastrichtian", "Maastrichtian",
"Maastrichtian", "Maastrichtian"), Institution = c("MOR",
"MOR", "MOR", "MOR", "MOR", "MOR", "MOR", "MOR", "MOR", "MOR"
), LatitudeDec = c(NA_real_, NA_real_, NA_real_, NA_real_,
NA_real_, NA_real_, NA_real_, NA_real_, NA_real_, NA_real_
)), row.names = c(NA, -10L), class = c("tbl_df", "tbl", "data.frame"
))