Bug? Violin plot removes points

Hi!

I am encountering an awkward problem when producing violin plots and using the position_jitterdodge.
(Reproducible code to illustrate my example follows in the end of the post)

If you compare the resulting figures and the annotated areas you will see that there there are several points that will not be drawn on the violin plot (probably more than I have discovered). To me this seems problematic, although a minor issue.

Do you see any fixes to this issue or is it a feature to reduce overplotting? I imagine it might be a bigger problem in some charts than in the example provided here.

Thanks for any comment on this =)

-Alex

library(tidyverse)

# Setting up example

id <- rep(1:30, 3)
type <- rep( 
    c(rep("sca", 8), rep("fca", 22)), 3)
data <- c( 
    rep("Gene 1", 30), 
    rep("Gene 2", 30), 
    rep("Gene 3", 30))

set.seed(123)
value1 <- c(rnorm(8, 0.48, sd=0.41), rnorm(22, 2.5, sd=2.))
value2 <- c(rnorm(8, 0.14, sd=0.14), rnorm(22, 2.6, sd=1.8))
value3 <- c(rnorm(8, 0.3, sd=0.2), rnorm(22, 1.39, sd=1.2))

tidydata2 <- tibble(id=id, type=type, data=data, value=c(value1, value2, value3))

pl <- ggplot(tidydata2, aes(x=data, y=value, color=type), alpha=1)

###################
# boxplot m/points + y_log10
###################
pl+ geom_boxplot()+
    geom_point(position = position_jitterdodge(jitter.width = 0.18, jitter.height = 0, seed = 1234))+
    ylab("Relative gene expression")+
    xlab("")+
    scale_y_log10()+
    theme_minimal()+
    annotate("rect", xmin=0.7, xmax=1, ymin=0.18, ymax=0.28, alpha=0.2)+
    annotate("rect", xmin=2, xmax=2.3, ymin=0.085, ymax=0.12, alpha=0.2)
ggsave("boxplot.png")

###################
# Violin m/points + y_log10
###################
pl+ geom_violin(scale="width")+
    geom_point(position = position_jitterdodge(jitter.width = 0.18, jitter.height = 0, seed = 1234), alpha=0.9)+
    ylab("Relative gene expression")+
    xlab("")+
    scale_y_log10()+
    theme_minimal()+
    annotate("rect", xmin=0.7, xmax=1, ymin=0.18, ymax=0.28, alpha=0.2)+
    annotate("rect", xmin=2, xmax=2.3, ymin=0.085, ymax=0.12, alpha=0.2)
ggsave("violinplot.png")

I'm not certain, but it could be that the violin plot is being drawn over the top of the points, and the lines of the violin outline are obscuring the "missing" points.

Does the problem persist if you make the points larger (i.e. set size = [something large] inside geom_point())?

I agree. I'm not 100% sure, as I had to add a few changes to your code to make it reproducible, @aleeie (there's no tidydata2 in the snippet you shared), but I think you're plotting over some of those points. Also note that you have a different jitter.width set for the violin and boxplots

library(tidyverse)
id <- rep(1:30, 3)
type <- rep( 
  c(rep("sca", 8), rep("fca", 22)), 3)
data <- c( 
  rep("Gene 1", 30), 
  rep("Gene 2", 30), 
  rep("Gene 3", 30))

set.seed(123)
value1 <- c(rnorm(8, 0.48, sd=0.41), rnorm(22, 2.5, sd=2.))
value2 <- c(rnorm(8, 0.14, sd=0.14), rnorm(22, 2.6, sd=1.8))
value3 <- c(rnorm(8, 0.3, sd=0.2), rnorm(22, 1.39, sd=1.2))

tidydata2 <- data_frame(data, id, type, value = c(value1, value2, value3))

pl <- ggplot(tidydata2, aes(x=data, y=value, color=type), alpha=1)

###################
# boxplot m/points + y_log10
###################
pl+ geom_boxplot()+
  geom_point(position = position_jitterdodge(jitter.width = 0.13, jitter.height = 0, seed = 1234))+
  ylab("Relative gene expression")+ xlab("")+
  scale_y_log10()+
  theme_minimal()+
  annotate("rect", xmin=0.7, xmax=1, ymin=0.18, ymax=0.28, alpha=0.2)+
  annotate("rect", xmin=2, xmax=2.3, ymin=0.085, ymax=0.12, alpha=0.2)
#> Warning in self$trans$transform(x): NaNs produced
#> Warning: Transformation introduced infinite values in continuous y-axis
#> Warning in self$trans$transform(x): NaNs produced
#> Warning: Transformation introduced infinite values in continuous y-axis
#> Warning: Removed 6 rows containing non-finite values (stat_boxplot).
#> Warning: Removed 6 rows containing missing values (geom_point).


###################
# Violin m/points + y_log10
###################
pl+ geom_violin(scale="width")+
  geom_point(position = position_jitterdodge(jitter.width = 0.18, jitter.height = 0, seed = 1234), alpha=0.9)+
  ylab("Relative gene expression")+ xlab("")+
  scale_y_log10()+
  theme_minimal()+
  annotate("rect", xmin=0.7, xmax=1, ymin=0.18, ymax=0.28, alpha=0.2)+
  annotate("rect", xmin=2, xmax=2.3, ymin=0.085, ymax=0.12, alpha=0.2)
#> Warning in self$trans$transform(x): NaNs produced
#> Warning: Transformation introduced infinite values in continuous y-axis
#> Warning in self$trans$transform(x): NaNs produced
#> Warning: Transformation introduced infinite values in continuous y-axis
#> Warning: Removed 6 rows containing non-finite values (stat_ydensity).
#> Warning: Removed 6 rows containing missing values (geom_point).

Created on 2018-09-17 by the reprex package (v0.2.1.9000)

If anything, it looks like the multiple points in the boxplot are off:

pl+
  geom_point(position = position_jitterdodge(jitter.width = 0.13, jitter.height = 0, seed = 1234))+
  ylab("Relative gene expression")+ xlab("")+
  scale_y_log10()+
  theme_minimal()+
  annotate("rect", xmin=0.7, xmax=1, ymin=0.18, ymax=0.28, alpha=0.2)+
  annotate("rect", xmin=2, xmax=2.3, ymin=0.085, ymax=0.12, alpha=0.2)
#> Warning in self$trans$transform(x): NaNs produced
#> Warning: Transformation introduced infinite values in continuous y-axis
#> Warning: Removed 6 rows containing missing values (geom_point).


If anything, it looks like the multiple points in the boxplot are off:

Is it possible that the "extra" points are outliers that a boxplot would normally display, and the fact that they're not visible in the violin plot is not because they're missing, but because the violin plot doesn't draw outliers as a boxplot does?

1 Like

Here I'm comparing it to the geom_point() plot without another geometry (bottom).

Is it possible that the "extra" points are outliers that a boxplot would normally display, and the fact that they're not visible in the violin plot is not because they're missing, but because the violin plot doesn't draw outliers as a boxplot does?

I think this is it. The geom_boxplot plots the outliers, without jitter. Then the the geom_point plots them again, with jitter and you appear to have pairs of points in the boxplot, but not the violin plot.

Messing about with the outlier.colour property of geom_boxplot and including/excluding the geom_boxplot/geom_violin makes it seem like this to me. Although I'm not sure I'm getting the right number of points when I count them on the plot (but that might be my eyes or a bits of overplotting).

Thanks for replies, @jim89 and @mara!

I fixed the post according to your comments, but the problem still persiststed.

  • geom_point size=2 doesn't affect the missing points
  • The points that are removed, ref mara's comment, is removed in both plots, so it shouldnt be missing in the violin plot.
  • Also, I am using the seed argument with the position_jitterdodge function, so the point shouldn't be overplotted, or plotted in any different spot in either of the plots?

UPDATE:
I do think I identified the issue here, and you are correct in the boxplot was the issue.
I needed to remove the boxplot outliers, as it was plotting the outlier points separately, thereby plotting them twice when plotted together with geom_point(). geom_point(outlier.shape=NA) fixed the issue!

Functional script as follows:


library(tidyverse)

# Setting up example

id <- rep(1:30, 3)
type <- rep( 
    c(rep("sca", 8), rep("fca", 22)), 3)
data <- c( 
    rep("Gene 1", 30), 
    rep("Gene 2", 30), 
    rep("Gene 3", 30))

set.seed(123)
value1 <- c(rnorm(8, 0.48, sd=0.41), rnorm(22, 2.5, sd=2.))
value2 <- c(rnorm(8, 0.14, sd=0.14), rnorm(22, 2.6, sd=1.8))
value3 <- c(rnorm(8, 0.3, sd=0.2), rnorm(22, 1.39, sd=1.2))

tidydata2 <- tibble(id=id, type=type, data=data, value=c(value1, value2, value3))

pl <- ggplot(tidydata2, aes(x=data, y=value, color=type), alpha=1)

# boxplot m/points + y_log10
pl+ geom_boxplot(outlier.shape=NA)+
    geom_point(position = position_jitterdodge(jitter.width = 0.18, jitter.height = 0, seed = 1234))+
    ylab("Relative gene expression")+
    xlab("")+
    scale_y_log10()+
    theme_minimal()+
    annotate("rect", xmin=0.7, xmax=1, ymin=0.18, ymax=0.28, alpha=0.2)+
    annotate("rect", xmin=2, xmax=2.3, ymin=0.085, ymax=0.12, alpha=0.2)
ggsave("results\\figures\\boxplot.png")

# Violin m/points + y_log10
pl+ geom_violin(scale="width")+
    geom_point(position = position_jitterdodge(jitter.width = 0.18, jitter.height = 0, seed = 1234))+
    ylab("Relative gene expression")+
    xlab("")+
    scale_y_log10()+
    theme_minimal()+
    annotate("rect", xmin=0.7, xmax=1, ymin=0.18, ymax=0.28, alpha=0.2)+
    annotate("rect", xmin=2, xmax=2.3, ymin=0.085, ymax=0.12, alpha=0.2)
ggsave("results\\figures\\violinplot.png")

1 Like

If your question's been answered (even if by you), would you mind choosing a solution? (See FAQ below for how).

Having questions checked as resolved makes it a bit easier to navigate the site visually and see which threads still need help.

Thanks