Help With Code for BoxPlots

Hello, I was wondering if someone might be able to help me with some coding issues. I know how to make a generic boxplot using the appropriate data but I was wondering if there was a way to distinguish between categorical variables within another categorical variable on boxplots. For example: my data set has the variables- Trial#, Site, Treatment_Salinity as my various categorical variables. I am wanting to create boxplots comparing Time Trial Data for each site (two different sites), trial#(three different trials) and salinity treatments (7 per site per trial). As of right now I was able to get the boxplots to compare just the salinity time trial differences but I need to be able to compare with those three variables essentially stacking into one another (I am sorry if this is confusing).

Another issue I am having is that right now I have a "." in the place where there is no data available for the numerical data, should I change that to NA so that R excludes it or should I leave the excel cell empty? 0's are values I have for the time trial data so I can't use 0 as a placeholder.

With the boxplots, is it possible to exclude anything that has 3 or less replicates? I realize this is a lot of questions but I am really struggling with this. Any and all help is appreciated.

Thank you!

I suggest you make a new data column that encodes the Site, Treatment and Trial information and use that as your x axis.
You can filter out combinations with 3 or fewer points using the group_by(), summarize() and filter() functions from the dplyr package.
I would leave the cells without data blank in Excel. They will then be imported as NA, I think.

I invented a toy data set to illustrate the method. The Treatment and Trial labels are just numbers, but any text will work.

DF <- data.frame(Site = rep(c("A","B"), each = 20),
                 Trial = rep(rep(1:2, each=5),4),
                 Treatment = rep(rep(1:2, each = 10),2),
                 value = rnorm(40))
head(DF,11)
#>    Site Trial Treatment      value
#> 1     A     1         1  0.2348408
#> 2     A     1         1  1.9123235
#> 3     A     1         1  0.7134633
#> 4     A     1         1 -1.3733097
#> 5     A     1         1  1.3975508
#> 6     A     2         1  0.1336876
#> 7     A     2         1  0.8763127
#> 8     A     2         1  0.5592192
#> 9     A     2         1 -0.6859869
#> 10    A     2         1 -0.3325136
#> 11    A     1         2  1.1437470
library(dplyr)
#> 
#> Attaching package: 'dplyr'
#> The following objects are masked from 'package:stats':
#> 
#>     filter, lag
#> The following objects are masked from 'package:base':
#> 
#>     intersect, setdiff, setequal, union
library(ggplot2)
DF <- DF |> mutate(Grp = paste(Site, Treatment, Trial, sep = "_"))
head(DF,11)
#>    Site Trial Treatment      value   Grp
#> 1     A     1         1  0.2348408 A_1_1
#> 2     A     1         1  1.9123235 A_1_1
#> 3     A     1         1  0.7134633 A_1_1
#> 4     A     1         1 -1.3733097 A_1_1
#> 5     A     1         1  1.3975508 A_1_1
#> 6     A     2         1  0.1336876 A_1_2
#> 7     A     2         1  0.8763127 A_1_2
#> 8     A     2         1  0.5592192 A_1_2
#> 9     A     2         1 -0.6859869 A_1_2
#> 10    A     2         1 -0.3325136 A_1_2
#> 11    A     1         2  1.1437470 A_2_1

Counts <- DF |> group_by(Grp) |> 
  filter(!is.na(value)) |> 
  summarize(N = n()) |> 
  filter(N > 3)
Counts
#> # A tibble: 8 × 2
#>   Grp       N
#>   <chr> <int>
#> 1 A_1_1     5
#> 2 A_1_2     5
#> 3 A_2_1     5
#> 4 A_2_2     5
#> 5 B_1_1     5
#> 6 B_1_2     5
#> 7 B_2_1     5
#> 8 B_2_2     5

DF |> semi_join(Counts, by = "Grp") |> #This will drop any Grp with 3 or fewer points
ggplot(aes(Grp, value)) + geom_boxplot() +
  labs(x = "Site_Treatment_Trial")

Created on 2022-12-29 with reprex v2.0.2

1 Like

Ok I will try this and see if it works. Thank you!


So I think I got it to work and I used scale_x_discrete or something similar to just plot specific groups. Is there a way to relabel the group bins without changing the group label itself (on the chart I mean)? Similar to how you can control the x-title and y-title?

In FJCC's example, I can set the labels to be x_1 through x_8

ggplot(data=DF,aes(Grp, value)) + geom_boxplot() +
  labs(x = "Site_Treatment_Trial") +
  scale_x_discrete(labels=paste0("X_",1:8))

2 Likes

That definitely worked! Thank you both so much for all of your help!

Ok so I have gotten both of those issues figured out but now I am having a problem with my boxplots showing up correctly.
I am using the same code for each of the boxplots (I have separate time trials that I had to run. When I do the boxplots for the Initial and Exposure Time Trials the boxplots look perfectly fine but for some reason when I try to run the exact same code but for the Post Time Trials it ends up looking like this mess.
Code:
library(ggplot2)

setwd("/Users/haley/documents") #set the working directory

Ch1_Raw_Data <- read.csv("GROUPED_MASTER_CH1.csv")

boxplot6 <- ggplot(Ch1_Raw_Data, aes(x=GROUP, y=Post_AVG)) + geom_boxplot(alpha = 0.80) + geom_boxplot(fill="gray")+ stat_boxplot(geom ='errorbar', width=.25) +
labs(title="E. depressus Post Time Trial AVG by Treatment for Trial 1",x="Treatment", y = "Time Trial (sec)")+ theme_classic() +theme(plot.title = element_text(hjust = 0.5)) + geom_boxplot(lwd=0.5) + geom_boxplot(fatten=2) + stat_summary(geom = "errorbar", fun.min = mean, fun = mean, fun.max = mean, width = .75, linetype = "dashed") + scale_x_discrete(limits = c("AEDSQ0", "AEDSQ0_2", "AEDSQ0_5", "AEDSQ1", "AEDSQ3", "AEDSQ5", "AEDSQ10"), labels = c("0 PPT", "0.2 PPT", "0.5 PPT", "1 PPT", "3 PPT", "5 PPT", "10 PPT"))

boxplot6

here is an image of the other one as an example. and here is the code:

library(ggplot2)

setwd("/Users/haley/documents") #set the working directory

Ch1_Raw_Data <- read.csv("GROUPED_MASTER_CH1.csv")

boxplot6 <- ggplot(Ch1_Raw_Data, aes(x=GROUP, y=Exposure_AVG)) + geom_boxplot(alpha = 0.80) + geom_boxplot(fill="gray")+ stat_boxplot(geom ='errorbar', width=.25) +
labs(title="E. depressus Exposure Time Trial AVG by Treatment for Trial 1",x="Treatment", y = "Time Trial (sec)")+ theme_classic() +theme(plot.title = element_text(hjust = 0.5)) + geom_boxplot(lwd=0.5) + geom_boxplot(fatten=2) + stat_summary(geom = "errorbar", fun.min = mean, fun = mean, fun.max = mean, width = .75, linetype = "dashed") + scale_x_discrete(limits = c("AEDSQ0", "AEDSQ0_2", "AEDSQ0_5", "AEDSQ1", "AEDSQ3", "AEDSQ5", "AEDSQ10"), labels = c("0 PPT", "0.2 PPT", "0.5 PPT", "1 PPT", "3 PPT", "5 PPT", "10 PPT"))

boxplot6

check your data types; Post_AVG is probably not numeric as you would need it to be.
compare it to your initial and exposure time.

y-axis scale needs adjustment—hard to believe so many significant digits are needed and there are many more than needed to show the approximate values of the data on right.

This topic was automatically closed 42 days after the last reply. New replies are no longer allowed.

If you have a query related to it or one of the replies, start a new topic and refer back with a link.