Histogram and density distribution in R by ggplot2

ggplot2

#1

Hello experts,

I have a sales data with values from 1 to 3000000. Most points are in the interval of [1,800] and thus, it has a very long tail.

If I use the following code to create a histogram, the graph looks like not good. Can anyone help with it? I guess it is caused by too speaded values of the x axis? Could I create different bins with different wideth in a same graph?

If I plot a density distribution, ggplot2 seems not produce a normal graph as well.

ggplot(hist_data) +
  geom_histogram(aes(x = sales), fill = "grey", color = "black")


#2

Hi!

You could use dplyr::filter to filter out the extreme x values. It's difficult for me to see what a good value to filter by would be from this graph, but because you say most are between 0 and 1800, you could try this:

library(tidyverse)
library(dplyr)
hist_data %>%
    filter(sales <= 1800) %>%
    ggplot() + 
    geom_histogram(aes(x = sales), fill = "grey", color = "black")

Let me know if that works!


#3

Thanks Fran. However, how can I do if I still want to plot all the data in the histogram, not by filter. Yes, most of the points are less than 1800, however, there are points from 1800 to 3,000,000 as well (although the density is low)


#4

No problem. You could also try changing around the binwidth argument in geom_histogram()

ggplot(hist_data) +
  geom_histogram(aes(x = sales), fill = "grey", color = "black", binwidth = 5000)

Let me know if that works!


#5

You could also try messing around with bins

https://ggplot2.tidyverse.org/reference/geom_histogram.html

Best,
fran


#6

Thanks, I will try it


#7

Another option is to use log-transformation, specifically in ggplot2 it is called scale_x_log10.


#8

Here is a little inspiration:

set.seed(860867)
my_dat = tibble(obs = c(sample(1:1800, 1000), sample(1:3e6, 100)))

my_dat %>%
  ggplot(aes(x=obs)) +
  geom_histogram() +
  theme_bw()

my_dat %>%
  ggplot(aes(x=obs)) +
  geom_histogram(binwidth = 1e6) +
  theme_bw()

my_dat %>%
  ggplot(aes(x=obs)) +
  geom_histogram() +
  scale_x_continuous(trans="log10") +
  theme_bw()