Histograms by graphics or ggplot2 lack some bins to be shown depending on data and graphic size

I find histograms by graphics or ggplot2 lacks some columns ought to be shown.

Is this a bug of graphic device?
Is there any way to work around?

I found the problem is reproducible when

  • Bins with 0 counts exist between bins with non-zero counts.
  • Size of graphic device is not large enough
set.seed(1)
x <- rpois(1e4, 1000) * 2

By ggplot2::geom_histogram

library(ggplot2)
ggplot(data.frame(x = x), aes(x)) +
  geom_histogram(binwidth = 1)

300 x 300 pixels

image

600 x 300 pixels

image

By graphics::hist

hist(x, breaks = 1700:2300, col = "black", border = "transparent")

500 x 300 pixels

image

1000 x 300 pixels

This should not occur with continuous data. And, as far as I understand, histogram is not appropriate for discrete data.

For these type of data, usually people use Bar Chart (using graphics::barplot or ggplot2::geom_bar).

My personal preference is to use a column diagram (I'm not sure whether this is a standard term or not), as widths of the bars make no sense to me.

You can see the illustration below. Here, I've used a smaller value of \lambda, only to make the plots less cluttered.

# for reproducibility
set.seed(seed = 29842)

# dummy data
u <- rpois(n = 1e+4,
           lambda = 10) * 2

# graphics:;barplot
barplot(height = table(u))

# ggplot2::geom_bar
library(ggplot2)
#> Registered S3 methods overwritten by 'ggplot2':
#>   method         from 
#>   [.quosures     rlang
#>   c.quosures     rlang
#>   print.quosures rlang

ggplot2::ggplot(data = data.frame(x = u),
                mapping = aes(x = u)) +
  geom_bar()


# my personal preference
plot(x = table(u),
     type = "h")

Created on 2019-05-02 by the reprex package (v0.2.1)

You are right that bar plot is more appropriate for discrete data, however, the problem may occur even if using bar plot.

In case of ggplot2, this is because both geom_histogram and geom_bar are using GeomBar to draw bins/bars.

set.seed(1)
x <- rpois(1e4, 1000) * 2
library(ggplot2)
ggplot(data.frame(x = x), aes(x)) +
  geom_histogram(binwidth = 1)

300 x 300 pixels

image

I'm not comfortable with ggplot2, and hence I'm not confident with what I'm going to say below.

I think if you use a binwidth argument in geom_bar, then it actually uses geom_histogram. I'm guessing it because of this:

binwidth geom_bar() no longer has a binwidth argument - if you use it you'll get an warning telling to you use geom_histogram() instead.

I hope others with more expertise in ggplot2 will chime in and confirm/discard this claim.

If I use geom_bar() instead, and even if I use 300x300, I do not see this problem.

set.seed(1)

x <- rpois(1e4, 1000) * 2

library(ggplot2)

png(filename = "bar_300_300.png",
    width = 300,
    height = 300)
ggplot(data.frame(x = x), aes(x)) +
  geom_bar()
dev.off()

bar_300_300


Edit

Actually, I don't face your problem even with your code. See below:

set.seed(1)
x <- rpois(1e4, 1000) * 2
library(ggplot2)
png("histogram_300_300.png",300,300)
ggplot(data.frame(x = x), aes(x)) +
  geom_histogram(binwidth = 1)
dev.off()

histogram_300_300
(I don't like this plot because of the sudden end, and also because I think it's wrong, but my point is that I don't get those awkward gaps.)

Thank you for your information.

So the problem seems to be system dependent.

I made the plots on the original post by resizing plots on RStudio.

Viewer -> Export -> Copy to Clipboard...

I attatch the sessionInfo() at the end of this reply.

May I ask yours too?

BTW, your codes on my environment gives much more strange results...

By geom_bar

image

By geom_histogram

image

sessionInfo()

The following environment is created by a docker image, atusy/atusyverse, which is primarily based on rocker/verse.

R version 3.5.3 (2019-03-11)
Platform: x86_64-pc-linux-gnu (64-bit)
Running under: Debian GNU/Linux 9 (stretch)

Matrix products: default
BLAS: /usr/lib/openblas-base/libblas.so.3
LAPACK: /usr/lib/libopenblasp-r0.2.19.so

locale:
 [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C
 [3] LC_TIME=en_US.UTF-8        LC_COLLATE=en_US.UTF-8
 [5] LC_MONETARY=en_US.UTF-8    LC_MESSAGES=C
 [7] LC_PAPER=en_US.UTF-8       LC_NAME=C
 [9] LC_ADDRESS=C               LC_TELEPHONE=C
[11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base

other attached packages:
[1] ggplot2_3.1.1

loaded via a namespace (and not attached):
 [1] Rcpp_1.0.1       withr_2.1.2      assertthat_0.2.1 crayon_1.3.4
 [5] dplyr_0.8.0.1    grid_3.5.3       plyr_1.8.4       R6_2.4.0
 [9] gtable_0.3.0     magrittr_1.5     scales_1.0.0     pillar_1.3.1
[13] rlang_0.3.4      lazyeval_0.2.2   rstudioapi_0.10  glue_1.3.1
[17] purrr_0.3.2      munsell_0.5.0    compiler_3.5.3   pkgconfig_2.0.2
[21] colorspace_1.4-1 tidyselect_0.2.5 tibble_2.1.1

Sure, here it is:

> sessionInfo()
R version 3.6.0 (2019-04-26)
Platform: x86_64-w64-mingw32/x64 (64-bit)
Running under: Windows 10 x64 (build 17763)

Matrix products: default

locale:
[1] LC_COLLATE=English_India.1252  LC_CTYPE=English_India.1252   
[3] LC_MONETARY=English_India.1252 LC_NUMERIC=C                  
[5] LC_TIME=English_India.1252    

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
[1] ggplot2_3.1.1

loaded via a namespace (and not attached):
 [1] Rcpp_1.0.1       withr_2.1.2      crayon_1.3.4     dplyr_0.8.0.1   
 [5] assertthat_0.2.1 grid_3.6.0       plyr_1.8.4       R6_2.4.0        
 [9] gtable_0.3.0     magrittr_1.5     scales_1.0.0     pillar_1.3.1    
[13] rlang_0.3.4      lazyeval_0.2.2   glue_1.3.1       purrr_0.3.2     
[17] munsell_0.5.0    compiler_3.6.0   pkgconfig_2.0.2  colorspace_1.4-1
[21] tidyselect_0.2.5 tibble_2.1.1 

The plots you've shown are really strange. I can only suggest you to try again in R itself, instead of RStudio to check whether you get the same or not.

But this is beyond my understanding and I can't help with this problem. Sorry, and good luck!

Just a small side note here that I think is relevant for people testing this examples.

As of R 3.6, the random number generator will use a different default algorithm. As a result, functions like sample() will give different results with the same seed.

Thank you @Yarnabrina anyway for helping me.

I tried the codes in your post #4 on R 3.6.0 on Manjaro Linux and still gets the same resulsts as my post #5...

Hope someone will help.

@andresrcs : I guess R version does not affect examples here because a change in sample() is related to a discrete uniform distribution, not Poisson distribution. https://cran.r-project.org/doc/manuals/r-devel/NEWS.html

This topic was automatically closed 21 days after the last reply. New replies are no longer allowed.

If you have a query related to it or one of the replies, start a new topic and refer back with a link.