Help With Code for BoxPlots

Hello, I was wondering if someone might be able to help me with some coding issues. I know how to make a generic boxplot using the appropriate data but I was wondering if there was a way to distinguish between categorical variables within another categorical variable on boxplots. For example: my data set has the variables- Trial#, Site, Treatment_Salinity as my various categorical variables. I am wanting to create boxplots comparing Time Trial Data for each site (two different sites), trial#(three different trials) and salinity treatments (7 per site per trial). As of right now I was able to get the boxplots to compare just the salinity time trial differences but I need to be able to compare with those three variables essentially stacking into one another (I am sorry if this is confusing).

Another issue I am having is that right now I have a "." in the place where there is no data available for the numerical data, should I change that to NA so that R excludes it or should I leave the excel cell empty? 0's are values I have for the time trial data so I can't use 0 as a placeholder.

With the boxplots, is it possible to exclude anything that has 3 or less replicates? I realize this is a lot of questions but I am really struggling with this. Any and all help is appreciated.

Thank you!

I suggest you make a new data column that encodes the Site, Treatment and Trial information and use that as your x axis.
You can filter out combinations with 3 or fewer points using the group_by(), summarize() and filter() functions from the dplyr package.
I would leave the cells without data blank in Excel. They will then be imported as NA, I think.

I invented a toy data set to illustrate the method. The Treatment and Trial labels are just numbers, but any text will work.

DF <- data.frame(Site = rep(c("A","B"), each = 20),
                 Trial = rep(rep(1:2, each=5),4),
                 Treatment = rep(rep(1:2, each = 10),2),
                 value = rnorm(40))
head(DF,11)
#>    Site Trial Treatment      value
#> 1     A     1         1  0.2348408
#> 2     A     1         1  1.9123235
#> 3     A     1         1  0.7134633
#> 4     A     1         1 -1.3733097
#> 5     A     1         1  1.3975508
#> 6     A     2         1  0.1336876
#> 7     A     2         1  0.8763127
#> 8     A     2         1  0.5592192
#> 9     A     2         1 -0.6859869
#> 10    A     2         1 -0.3325136
#> 11    A     1         2  1.1437470
library(dplyr)
#> 
#> Attaching package: 'dplyr'
#> The following objects are masked from 'package:stats':
#> 
#>     filter, lag
#> The following objects are masked from 'package:base':
#> 
#>     intersect, setdiff, setequal, union
library(ggplot2)
DF <- DF |> mutate(Grp = paste(Site, Treatment, Trial, sep = "_"))
head(DF,11)
#>    Site Trial Treatment      value   Grp
#> 1     A     1         1  0.2348408 A_1_1
#> 2     A     1         1  1.9123235 A_1_1
#> 3     A     1         1  0.7134633 A_1_1
#> 4     A     1         1 -1.3733097 A_1_1
#> 5     A     1         1  1.3975508 A_1_1
#> 6     A     2         1  0.1336876 A_1_2
#> 7     A     2         1  0.8763127 A_1_2
#> 8     A     2         1  0.5592192 A_1_2
#> 9     A     2         1 -0.6859869 A_1_2
#> 10    A     2         1 -0.3325136 A_1_2
#> 11    A     1         2  1.1437470 A_2_1

Counts <- DF |> group_by(Grp) |> 
  filter(!is.na(value)) |> 
  summarize(N = n()) |> 
  filter(N > 3)
Counts
#> # A tibble: 8 × 2
#>   Grp       N
#>   <chr> <int>
#> 1 A_1_1     5
#> 2 A_1_2     5
#> 3 A_2_1     5
#> 4 A_2_2     5
#> 5 B_1_1     5
#> 6 B_1_2     5
#> 7 B_2_1     5
#> 8 B_2_2     5

DF |> semi_join(Counts, by = "Grp") |> #This will drop any Grp with 3 or fewer points
ggplot(aes(Grp, value)) + geom_boxplot() +
  labs(x = "Site_Treatment_Trial")

Created on 2022-12-29 with reprex v2.0.2

1 Like

Ok I will try this and see if it works. Thank you!


So I think I got it to work and I used scale_x_discrete or something similar to just plot specific groups. Is there a way to relabel the group bins without changing the group label itself (on the chart I mean)? Similar to how you can control the x-title and y-title?

In FJCC's example, I can set the labels to be x_1 through x_8

ggplot(data=DF,aes(Grp, value)) + geom_boxplot() +
  labs(x = "Site_Treatment_Trial") +
  scale_x_discrete(labels=paste0("X_",1:8))

2 Likes

That definitely worked! Thank you both so much for all of your help!

Ok so I have gotten both of those issues figured out but now I am having a problem with my boxplots showing up correctly.
I am using the same code for each of the boxplots (I have separate time trials that I had to run. When I do the boxplots for the Initial and Exposure Time Trials the boxplots look perfectly fine but for some reason when I try to run the exact same code but for the Post Time Trials it ends up looking like this mess.
Code:
library(ggplot2)

setwd("/Users/haley/documents") #set the working directory

Ch1_Raw_Data <- read.csv("GROUPED_MASTER_CH1.csv")

boxplot6 <- ggplot(Ch1_Raw_Data, aes(x=GROUP, y=Post_AVG)) + geom_boxplot(alpha = 0.80) + geom_boxplot(fill="gray")+ stat_boxplot(geom ='errorbar', width=.25) +
labs(title="E. depressus Post Time Trial AVG by Treatment for Trial 1",x="Treatment", y = "Time Trial (sec)")+ theme_classic() +theme(plot.title = element_text(hjust = 0.5)) + geom_boxplot(lwd=0.5) + geom_boxplot(fatten=2) + stat_summary(geom = "errorbar", fun.min = mean, fun = mean, fun.max = mean, width = .75, linetype = "dashed") + scale_x_discrete(limits = c("AEDSQ0", "AEDSQ0_2", "AEDSQ0_5", "AEDSQ1", "AEDSQ3", "AEDSQ5", "AEDSQ10"), labels = c("0 PPT", "0.2 PPT", "0.5 PPT", "1 PPT", "3 PPT", "5 PPT", "10 PPT"))

boxplot6

here is an image of the other one as an example. and here is the code:

library(ggplot2)

setwd("/Users/haley/documents") #set the working directory

Ch1_Raw_Data <- read.csv("GROUPED_MASTER_CH1.csv")

boxplot6 <- ggplot(Ch1_Raw_Data, aes(x=GROUP, y=Exposure_AVG)) + geom_boxplot(alpha = 0.80) + geom_boxplot(fill="gray")+ stat_boxplot(geom ='errorbar', width=.25) +
labs(title="E. depressus Exposure Time Trial AVG by Treatment for Trial 1",x="Treatment", y = "Time Trial (sec)")+ theme_classic() +theme(plot.title = element_text(hjust = 0.5)) + geom_boxplot(lwd=0.5) + geom_boxplot(fatten=2) + stat_summary(geom = "errorbar", fun.min = mean, fun = mean, fun.max = mean, width = .75, linetype = "dashed") + scale_x_discrete(limits = c("AEDSQ0", "AEDSQ0_2", "AEDSQ0_5", "AEDSQ1", "AEDSQ3", "AEDSQ5", "AEDSQ10"), labels = c("0 PPT", "0.2 PPT", "0.5 PPT", "1 PPT", "3 PPT", "5 PPT", "10 PPT"))

boxplot6

check your data types; Post_AVG is probably not numeric as you would need it to be.
compare it to your initial and exposure time.

y-axis scale needs adjustment—hard to believe so many significant digits are needed and there are many more than needed to show the approximate values of the data on right.