Role of Bins in stat_density2d

Hello colleagues,
Presently, I am trying to get a 2 dimensional map for Chicago crimes, and am using sort of heat map feature. We are using longitude and latitude on x and y axis respectively.
I am trying to understand the role of the bins feature in stat_density2d function. From the regular concept of histogram in one dimension, larger number of bins means we get kinda skinny bars, holding lesser numbers of values in each bar.
But, now that we are working on 2 dimensional histogram, I am having difficulty in grasping the concept. To this end, I implemented my same code with two different values for the bins feature. Here is my original code;

gg <- ggplot(data=crime_2003%>%
  filter(Location.Description=="STREET"), aes(x=Longitude, y=Latitude)) +
  #geom_point(colour='red')+
  stat_density2d(aes(fill=..level..), geom="polygon",bins=12)+
  scale_fill_gradient(low="skyblue2", high="firebrick1", name="Distribution")+
  labs(title="Distribution of Street Related Crimes in Chicago in 2003",
                  subtitle="Western Chicago seems to have majority of Street Related Crimes in 2003",
                  caption="Source: Chicago City")
gg

Here bins=12 and I obtain the following plot, labelled, bins_12. Then I changed number of bins to 5 and I got the plot, bins_5. It seems bins=12 has less number of breaks. Can I kindly get some advice on how to interpret this difference?thanks

I don't know much about ggplot2 (in fact, next to nothing), but let me try to provide an interpretation of bins from another perspective. Also, from the documentation, I couldn't find the bins argument.

When you use bins in a one-dimensional histogram, their purpose is to divide the range of the entire range of the available data (an interval on \mathbb{R}) in some parts. Then you find out how much proportions of observations are in each of these parts. So, obviously if you increase the number of bins, the bars will have lesser width (most probably that's what you meant when you said skinny), and they will have less number of points, but not necessarily less proportions on an average, as length of each intervals will also decrease.

Now, let's consider the two dimension. Here's the range of the available data is a two dimensional shape (a subset of \mathbb{R}^2). So, generalising from the previous argument, here also you note the proportion of observations (same as density) in each of the parts.

The more bins you use, the finer the parts are and hence you can infer with more precision. Here, in the two pictures, note that the picture with 5 bins has crudely divided the region, whereas the one with 12 bins has made finer partitions (regions which are in same shade of red or blue in case of 5 bins are now divided in different shades of red and blue) and so it is possible to distinguish between closer regions.

I'm not sure whether it makes any sense at all or not.

In general, I think that the parameters for stat_density2d() are incompletely documented (there is an issue and pull request open in ggplot2 to fix the documentation.

Are you sure that more bins = fewer lines? I get the following with the mpg dataset:

library(ggplot2)

ggplot(mpg, aes(hwy, displ)) +
  geom_point() +
  stat_density2d(bins = 20)


ggplot(mpg, aes(hwy, displ)) +
  geom_point() +
  stat_density2d(bins = 10)


ggplot(mpg, aes(hwy, displ)) +
  geom_point() +
  stat_density2d(bins = 5)

Created on 2019-03-08 by the reprex package (v0.2.1)

2 Likes

Thanks for posting the GitHub link. I am not being able to understand; what is meant by more bins=fewer lines? In the example I have provided, there is no mention of lines anywhere. Can you kindly clarify your point of discussion?
And, I feel that the argument provided by the sustainer makes sense logically. more bins implies higher degree of granularity.

Thanks for the healthy discussion.

You are right! I thought you were implying the opposite in your original post. More bins means a higher degree of granularity (there should be [number of bins] - 1 contour lines on the plot).

This topic was automatically closed 21 days after the last reply. New replies are no longer allowed.

If you have a query related to it or one of the replies, start a new topic and refer back with a link.