Summing specific ranges in a dataframe

I am looking for a solution to sum values of specific ranges (e.g. 0.0-1.0, 1.1-2.0, 2.1-3.0, etc.) in a DF that then can be analyzed in ggplot.. I have the bins properly labeled in the viz the way I want them.

Code is below and resulting ggplot viz which is not correct and displayed sums individually:

WAR21_labels = WAR21 %>% 
  count(WAR)```

How can I sum specific ranges instead of just totals? I need to fix this issue in the code chunk above before executing any ggplot viz. My viz is close but has unique instances listed all over the place. 




```getwd()
library(ggplot2)
library(dplyr)
library(base)
library(dplyr)
library(tidyverse)
WAR <- read.csv("WAR.csv")
View(WAR)
## Only Pitchers displayed in a DF
pitchers <- filter(WAR, Type == "Pitcher")
View(pitchers)
##2020 pitchers only
pitchers20 <- filter(pitchers, year == 2020)
View(pitchers20)
##2021 pitchers only
pitchers21 <- filter(pitchers, year == 2021)
View(pitchers21)
##Only Hitters displayed in a DF
hitters <- filter(WAR, Type == "Hitter")
##2020 hitters only
hitters20 <- filter(hitters, year == 2020)
##2021 hitters only
hitters21 <- filter(hitters, year == 2021)
##2020 all WAR
WAR20 <- filter (WAR, year ==2020)
#2021 all WAR
WAR21 <- filter( WAR, year == 2021)

#2021 WAR histogram with custom bins and scales
ggplot(WAR21, aes(x=WAR))+
  geom_histogram(fill='steelblue', col='black', bins=12)+
  labs(title = "2021 MLB fWAR Distribution, No PA/IP Minimums")+
  scale_x_continuous(breaks = seq(-2.,8.5, by = 1.00))+
  ylab ("NA dont want a Y axis")+
  xlab("fWAR")+
  theme(axis.title.y = element_text(color="#993333", size=13, face="bold"))+
  theme(axis.title.x = element_text(color="#993333", size=13, face="bold"))+
  theme(plot.title = element_text(color="Dark Red", size=14, face="bold.italic"))+
  theme(axis.text.x = element_text(color = "dark red", size = 9, face ="bold"))

##editing this and testing below

#Summary counts of the datasets
WAR21_labels = WAR21 %>% 
  count(WAR)

ggplot(WAR21, aes(x=WAR))+
  geom_histogram(fill='steelblue', col='black', bins=10)+
  geom_text(data = WAR21_labels, aes(x = WAR, y = n, label = n), vjust =-0.8, size = 3) +
  labs(title = "2021 MLB fWAR Distribution, No PA/IP Minimums")+
  scale_x_continuous(breaks = seq(-2.0, 8.5, by = 1.0),
                     # updating bin labels (same length as breaks)
                     labels = c('-2 to -1', '-1 to 0', '0 to 1', '1 to 2', '2 to 3', '3 to 4', '4 to 5', '5 to 6', '6 to 7', '7 to 8', '8 to 9'))+ 
  ylab ("")+
  xlab("fWAR")+
  # updating to removes y-axis counts and ticks
  theme(axis.text.y = element_blank()) + 
  theme(axis.ticks.y = element_blank()) + 
  theme(axis.title.y = element_text(color="#993333", size=13, face="bold"))+
  theme(axis.title.x = element_text(color="#993333", size=13, face="bold"))+
  theme(plot.title = element_text(color="Dark Red", size=14, face="bold.italic"))+
  theme(axis.text.x = element_text(color = "dark red", size = 9, face ="bold"))

Hi @Dan0o0o,

What if you created a subset of the WAR21 dataframe to have just the two variables you need?

I'm not sure what your WAR21 dataframe looks like, but I would imagine based on your plot that you want a column called fWAR that has your categories in it (-2 to -1 all the way to 8 to 9).

Then another variable that is the count(WAR) part of your code? Again not sure what your WAR21 dataframe looks like so can't assume there.

So you would probably do something like this?

WAR21_subset <- WAR21 %>%
  mutate(fWAR = case_when(fWAR >= -2 | fWAR <= -1 ~ "-2 to -1",
                          fWAR >-1 | fWAR <= 0 ~ "-1 to 0",
                          etc...)) %>% 
  select(fWAR, WAR) %>% 
  group_by(fWAR) %>% 
  summarize(WAR = sum(WAR))

Than using a similar plot from there, but maybe just geombar as opposed to a histogram? Since you've specified your "bins" in the categorical variable fWAR of the new WAR21_subset?

Hope that at least gets the creative juices flowing!

thanks for this suggestion. I am going to stew on this. What would be the difference/upside of a geombar as opposed to a histogram? What is the difference between the two?

That's a great question! And I was just throwing things out there so geombar is actually geom_bar, and you can read about that here

But the difference between a histogram and a bar chart is a histogram takes a single numerical variable and plots its count across the bins. Whereas a bar plot takes both a numerical variable with a categorical variable. So in this case you have your categorical variable as the fWAR and your numerical variable of WAR or sum(WAR).

Looking at this further you would probably want to use geom_col() as this will plot the height of your columns based on the value (in this case the sum(WAR) counts.

If you set up your dataframe as I suggested in the previous response, try running something like this instead of the histogram.

# Example dataframe based on my last comment
WAR21_subset <- tibble(sum_of_WAR = c(10, 11, 9, 4, 15, 12, 24, 29, 42, 10, 13),
                       fWAR = c("-2 to -1", "-1 to 0", "0 to 1", "1 to 2", 
                                "2 to 3", "3 to 4", "4 to 5", "5 to 6",
                                "6 to 7", "7 to 8", "8 to 9"))

# A ggplot2 barplot
ggplot(WAR21_subset, aes(fWAR, sum_of_WAR)) +
  geom_col()

Good luck!

Two small comments that might be useful: you can save a lot of time using cut() instead of that very long case_when()! In that case, you prepare a vector of cut-points and directly use it:

library(tidyverse)

# Create fake data
WAR21 <- tibble(WAR = runif(100, min = 0, max = 10),
                player = sample(LETTERS, size = 100, replace = TRUE))


my_bins <- c(0,2,4,6,8,10)

WAR21 %>%
  mutate(WAR_binned = cut(WAR, my_bins)) %>%
  group_by(WAR_binned) %>%
  count() %>%
  ggplot() +
  geom_col(aes(x = WAR_binned, y = n))

Created on 2022-09-25 by the reprex package (v2.0.1)

Second remark, on the difference between geom_col(), geom_bar(), and histogram: I would say that here it's actually the same thing. The difference is that the histogram functions usually try to automatically compute the bins, but if you give them manually (as here), you can actually get the exact same result with these three functions.

# fully manual
WAR21 %>%
  mutate(WAR_binned = cut(WAR, my_bins)) %>%
  group_by(WAR_binned) %>%
  count() %>%
  ggplot() +
  geom_col(aes(x = WAR_binned, y = n))

# geom_bar automatically takes the sum
WAR21 %>%
  mutate(WAR_binned = cut(WAR, my_bins)) %>%
  ggplot() +
  geom_bar(aes(x = WAR_binned))

# geom_histogram automatically chooses breaks (here I force them) and automatically sums
ggplot(WAR21) +
  geom_histogram(aes(x = WAR), breaks = my_bins, color = "white")
1 Like

@AlexisW very cool! Thanks for sharing the cut function, as well as the added information regarding the differences and similarities between the barplot and histogram.