Hello, I was wondering if someone might be able to help me with some coding issues. I know how to make a generic boxplot using the appropriate data but I was wondering if there was a way to distinguish between categorical variables within another categorical variable on boxplots. For example: my data set has the variables- Trial#, Site, Treatment_Salinity as my various categorical variables. I am wanting to create boxplots comparing Time Trial Data for each site (two different sites), trial#(three different trials) and salinity treatments (7 per site per trial). As of right now I was able to get the boxplots to compare just the salinity time trial differences but I need to be able to compare with those three variables essentially stacking into one another (I am sorry if this is confusing).
Another issue I am having is that right now I have a "." in the place where there is no data available for the numerical data, should I change that to NA so that R excludes it or should I leave the excel cell empty? 0's are values I have for the time trial data so I can't use 0 as a placeholder.
With the boxplots, is it possible to exclude anything that has 3 or less replicates? I realize this is a lot of questions but I am really struggling with this. Any and all help is appreciated.
I suggest you make a new data column that encodes the Site, Treatment and Trial information and use that as your x axis.
You can filter out combinations with 3 or fewer points using the group_by(), summarize() and filter() functions from the dplyr package.
I would leave the cells without data blank in Excel. They will then be imported as NA, I think.
I invented a toy data set to illustrate the method. The Treatment and Trial labels are just numbers, but any text will work.
DF <- data.frame(Site = rep(c("A","B"), each = 20),
Trial = rep(rep(1:2, each=5),4),
Treatment = rep(rep(1:2, each = 10),2),
value = rnorm(40))
head(DF,11)
#> Site Trial Treatment value
#> 1 A 1 1 0.2348408
#> 2 A 1 1 1.9123235
#> 3 A 1 1 0.7134633
#> 4 A 1 1 -1.3733097
#> 5 A 1 1 1.3975508
#> 6 A 2 1 0.1336876
#> 7 A 2 1 0.8763127
#> 8 A 2 1 0.5592192
#> 9 A 2 1 -0.6859869
#> 10 A 2 1 -0.3325136
#> 11 A 1 2 1.1437470
library(dplyr)
#>
#> Attaching package: 'dplyr'
#> The following objects are masked from 'package:stats':
#>
#> filter, lag
#> The following objects are masked from 'package:base':
#>
#> intersect, setdiff, setequal, union
library(ggplot2)
DF <- DF |> mutate(Grp = paste(Site, Treatment, Trial, sep = "_"))
head(DF,11)
#> Site Trial Treatment value Grp
#> 1 A 1 1 0.2348408 A_1_1
#> 2 A 1 1 1.9123235 A_1_1
#> 3 A 1 1 0.7134633 A_1_1
#> 4 A 1 1 -1.3733097 A_1_1
#> 5 A 1 1 1.3975508 A_1_1
#> 6 A 2 1 0.1336876 A_1_2
#> 7 A 2 1 0.8763127 A_1_2
#> 8 A 2 1 0.5592192 A_1_2
#> 9 A 2 1 -0.6859869 A_1_2
#> 10 A 2 1 -0.3325136 A_1_2
#> 11 A 1 2 1.1437470 A_2_1
Counts <- DF |> group_by(Grp) |>
filter(!is.na(value)) |>
summarize(N = n()) |>
filter(N > 3)
Counts
#> # A tibble: 8 × 2
#> Grp N
#> <chr> <int>
#> 1 A_1_1 5
#> 2 A_1_2 5
#> 3 A_2_1 5
#> 4 A_2_2 5
#> 5 B_1_1 5
#> 6 B_1_2 5
#> 7 B_2_1 5
#> 8 B_2_2 5
DF |> semi_join(Counts, by = "Grp") |> #This will drop any Grp with 3 or fewer points
ggplot(aes(Grp, value)) + geom_boxplot() +
labs(x = "Site_Treatment_Trial")
So I think I got it to work and I used scale_x_discrete or something similar to just plot specific groups. Is there a way to relabel the group bins without changing the group label itself (on the chart I mean)? Similar to how you can control the x-title and y-title?
Ok so I have gotten both of those issues figured out but now I am having a problem with my boxplots showing up correctly.
I am using the same code for each of the boxplots (I have separate time trials that I had to run. When I do the boxplots for the Initial and Exposure Time Trials the boxplots look perfectly fine but for some reason when I try to run the exact same code but for the Post Time Trials it ends up looking like this mess.
Code:
library(ggplot2)
setwd("/Users/haley/documents") #set the working directory
y-axis scale needs adjustment—hard to believe so many significant digits are needed and there are many more than needed to show the approximate values of the data on right.