Summing specific ranges in a dataframe

Dan0o0o · September 20, 2022, 9:34pm

I am looking for a solution to sum values of specific ranges (e.g. 0.0-1.0, 1.1-2.0, 2.1-3.0, etc.) in a DF that then can be analyzed in ggplot.. I have the bins properly labeled in the viz the way I want them.

Code is below and resulting ggplot viz which is not correct and displayed sums individually:

WAR21_labels = WAR21 %>% 
  count(WAR)```

How can I sum specific ranges instead of just totals? I need to fix this issue in the code chunk above before executing any ggplot viz. My viz is close but has unique instances listed all over the place. 




```getwd()
library(ggplot2)
library(dplyr)
library(base)
library(dplyr)
library(tidyverse)
WAR <- read.csv("WAR.csv")
View(WAR)
## Only Pitchers displayed in a DF
pitchers <- filter(WAR, Type == "Pitcher")
View(pitchers)
##2020 pitchers only
pitchers20 <- filter(pitchers, year == 2020)
View(pitchers20)
##2021 pitchers only
pitchers21 <- filter(pitchers, year == 2021)
View(pitchers21)
##Only Hitters displayed in a DF
hitters <- filter(WAR, Type == "Hitter")
##2020 hitters only
hitters20 <- filter(hitters, year == 2020)
##2021 hitters only
hitters21 <- filter(hitters, year == 2021)
##2020 all WAR
WAR20 <- filter (WAR, year ==2020)
#2021 all WAR
WAR21 <- filter( WAR, year == 2021)

#2021 WAR histogram with custom bins and scales
ggplot(WAR21, aes(x=WAR))+
  geom_histogram(fill='steelblue', col='black', bins=12)+
  labs(title = "2021 MLB fWAR Distribution, No PA/IP Minimums")+
  scale_x_continuous(breaks = seq(-2.,8.5, by = 1.00))+
  ylab ("NA dont want a Y axis")+
  xlab("fWAR")+
  theme(axis.title.y = element_text(color="#993333", size=13, face="bold"))+
  theme(axis.title.x = element_text(color="#993333", size=13, face="bold"))+
  theme(plot.title = element_text(color="Dark Red", size=14, face="bold.italic"))+
  theme(axis.text.x = element_text(color = "dark red", size = 9, face ="bold"))

##editing this and testing below

#Summary counts of the datasets
WAR21_labels = WAR21 %>% 
  count(WAR)

ggplot(WAR21, aes(x=WAR))+
  geom_histogram(fill='steelblue', col='black', bins=10)+
  geom_text(data = WAR21_labels, aes(x = WAR, y = n, label = n), vjust =-0.8, size = 3) +
  labs(title = "2021 MLB fWAR Distribution, No PA/IP Minimums")+
  scale_x_continuous(breaks = seq(-2.0, 8.5, by = 1.0),
                     # updating bin labels (same length as breaks)
                     labels = c('-2 to -1', '-1 to 0', '0 to 1', '1 to 2', '2 to 3', '3 to 4', '4 to 5', '5 to 6', '6 to 7', '7 to 8', '8 to 9'))+ 
  ylab ("")+
  xlab("fWAR")+
  # updating to removes y-axis counts and ticks
  theme(axis.text.y = element_blank()) + 
  theme(axis.ticks.y = element_blank()) + 
  theme(axis.title.y = element_text(color="#993333", size=13, face="bold"))+
  theme(axis.title.x = element_text(color="#993333", size=13, face="bold"))+
  theme(plot.title = element_text(color="Dark Red", size=14, face="bold.italic"))+
  theme(axis.text.x = element_text(color = "dark red", size = 9, face ="bold"))

jonesey441 · September 20, 2022, 10:55pm

Hi @Dan0o0o,

What if you created a subset of the WAR21 dataframe to have just the two variables you need?

I'm not sure what your WAR21 dataframe looks like, but I would imagine based on your plot that you want a column called fWAR that has your categories in it (-2 to -1 all the way to 8 to 9).

Then another variable that is the count(WAR) part of your code? Again not sure what your WAR21 dataframe looks like so can't assume there.

So you would probably do something like this?

WAR21_subset <- WAR21 %>%
  mutate(fWAR = case_when(fWAR >= -2 | fWAR <= -1 ~ "-2 to -1",
                          fWAR >-1 | fWAR <= 0 ~ "-1 to 0",
                          etc...)) %>% 
  select(fWAR, WAR) %>% 
  group_by(fWAR) %>% 
  summarize(WAR = sum(WAR))

Than using a similar plot from there, but maybe just geombar as opposed to a histogram? Since you've specified your "bins" in the categorical variable fWAR of the new WAR21_subset?

Hope that at least gets the creative juices flowing!

Dan0o0o · September 20, 2022, 11:38pm

thanks for this suggestion. I am going to stew on this. What would be the difference/upside of a geombar as opposed to a histogram? What is the difference between the two?

jonesey441 · September 21, 2022, 12:15am

That's a great question! And I was just throwing things out there so geombar is actually geom_bar, and you can read about that here

But the difference between a histogram and a bar chart is a histogram takes a single numerical variable and plots its count across the bins. Whereas a bar plot takes both a numerical variable with a categorical variable. So in this case you have your categorical variable as the fWAR and your numerical variable of WAR or sum(WAR).

Looking at this further you would probably want to use geom_col() as this will plot the height of your columns based on the value (in this case the sum(WAR) counts.

If you set up your dataframe as I suggested in the previous response, try running something like this instead of the histogram.

# Example dataframe based on my last comment
WAR21_subset <- tibble(sum_of_WAR = c(10, 11, 9, 4, 15, 12, 24, 29, 42, 10, 13),
                       fWAR = c("-2 to -1", "-1 to 0", "0 to 1", "1 to 2", 
                                "2 to 3", "3 to 4", "4 to 5", "5 to 6",
                                "6 to 7", "7 to 8", "8 to 9"))

# A ggplot2 barplot
ggplot(WAR21_subset, aes(fWAR, sum_of_WAR)) +
  geom_col()

Good luck!

AlexisW · September 25, 2022, 8:55pm

Two small comments that might be useful: you can save a lot of time using cut() instead of that very long case_when()! In that case, you prepare a vector of cut-points and directly use it:

library(tidyverse)

# Create fake data
WAR21 <- tibble(WAR = runif(100, min = 0, max = 10),
                player = sample(LETTERS, size = 100, replace = TRUE))


my_bins <- c(0,2,4,6,8,10)

WAR21 %>%
  mutate(WAR_binned = cut(WAR, my_bins)) %>%
  group_by(WAR_binned) %>%
  count() %>%
  ggplot() +
  geom_col(aes(x = WAR_binned, y = n))

^{Created on 2022-09-25 by the reprex package (v2.0.1)}

Second remark, on the difference between geom_col(), geom_bar(), and histogram: I would say that here it's actually the same thing. The difference is that the histogram functions usually try to automatically compute the bins, but if you give them manually (as here), you can actually get the exact same result with these three functions.

# fully manual
WAR21 %>%
  mutate(WAR_binned = cut(WAR, my_bins)) %>%
  group_by(WAR_binned) %>%
  count() %>%
  ggplot() +
  geom_col(aes(x = WAR_binned, y = n))

# geom_bar automatically takes the sum
WAR21 %>%
  mutate(WAR_binned = cut(WAR, my_bins)) %>%
  ggplot() +
  geom_bar(aes(x = WAR_binned))

# geom_histogram automatically chooses breaks (here I force them) and automatically sums
ggplot(WAR21) +
  geom_histogram(aes(x = WAR), breaks = my_bins, color = "white")

jonesey441 · September 25, 2022, 9:43pm

@AlexisW very cool! Thanks for sharing the cut function, as well as the added information regarding the differences and similarities between the barplot and histogram.

Dan0o0o · September 29, 2022, 9:57pm

Thanks for this suggestion. I tried this but was having issues getting it to work with it saying my X was not numeric. I am so far down the other path using case_when that I don't think it's be ideal to start all over.

Dan0o0o · September 29, 2022, 10:09pm

I am close on fixing this issue but still having a heck of a time to finish it off.

How can I sum specific ranges instead of just totals in my bins? I need to fix this issue in the code chunk below before executing a final, complete ggplot viz. My viz is close but has unique instances listed all over the place inside the bars of the viz below.

What do you suggest? I've gotten close with case_when but the totals are off. Any thoughts?

code is below...

library(ggplot2)
library(dplyr)
library(base)
library(dplyr)
library(tidyverse)
WAR <- read.csv("WAR.csv")
View(WAR)
## Only Pitchers displayed in a DF
pitchers <- filter(WAR, Type == "Pitcher")
View(pitchers)
##2020 pitchers only
pitchers20 <- filter(pitchers, year == 2020)
View(pitchers20)
##2021 pitchers only
pitchers21 <- filter(pitchers, year == 2021)
View(pitchers21)
##Only Hitters displayed in a DF
hitters <- filter(WAR, Type == "Hitter")
##2020 hitters only
hitters20 <- filter(hitters, year == 2020)
##2021 hitters only
hitters21 <- filter(hitters, year == 2021)
##2020 all WAR
WAR20 <- filter (WAR, year ==2020)
#2021 all WAR
WAR21 <- filter( WAR, year == 2021)

#2021 WAR histogram with custom bins and scales, not complete and need to remove Y axis
ggplot(WAR21, aes(x=WAR))+
  geom_histogram(fill='steelblue', col='black', bins=12)+
  labs(title = "2021 MLB fWAR Distribution, No PA/IP Minimums")+
  scale_x_continuous(breaks = seq(-2.,8.5, by = 1.00))+
  ylab ("NA dont want a Y axis")+
  xlab("fWAR")+
  theme(axis.title.y = element_text(color="#993333", size=13, face="bold"))+
  theme(axis.title.x = element_text(color="#993333", size=13, face="bold"))+
  theme(plot.title = element_text(color="Dark Red", size=14, face="bold.italic"))+
  theme(axis.text.x = element_text(color = "dark red", size = 9, face ="bold"))

View(WAR21)

#Summary counts of the datasets
WAR21_labels = WAR21 %>% 
  count(WAR)

View(WAR21_labels)

##Using case_when to create WAR bins

WAR21_subset <- WAR21 %>%
  mutate(fWAR = case_when(WAR >= -2 |WAR <= -1 ~ "-2 to -1",
                          WAR >= -1 |WAR <= 0 ~ "-1 to 0",
                          WAR >= 0 | WAR <= 1 ~ "0 to 1",
                          WAR >= 1 | WAR <= 2 ~ "1 to 2",
                          WAR >= 2 | WAR <= 3 ~ "2 to 3",
                          WAR >= 3 | WAR <= 4 ~ "3 to 4",
                          WAR >= 4 | WAR <= 5 ~ "4 to 5",
                          WAR >= 5 | WAR <= 6 ~ "5 to 6",
                          WAR >= 6 | WAR <= 7 ~ "6 to 7",
                          WAR >= 7 | WAR <= 8 ~ "7 to 8",)) %>% 
  select(fWAR, WAR) %>% 
  group_by(fWAR) %>% 
  summarize(fWAR = sum(WAR))

View(WAR21_subset)

##this is close but the unique values are each listed within each bar, not summed correctly, I want totals in each bar (e.g. 0-1 bin....787+212+164,etc)
ggplot(WAR21, aes(x=WAR))+
  geom_histogram(fill='steelblue', col='black', bins=10)+
  geom_text(data = WAR21_labels, aes(x = WAR, y = n, label = n), vjust =-0.8, size = 3) +
  labs(title = "2021 MLB fWAR Distribution, No PA/IP Minimums")+
  scale_x_continuous(breaks = seq(-2.0, 8.5, by = 1.0),
                     # updating bin labels (same length as breaks)
                     labels = c('-2 to -1', '-1 to 0', '0 to 1', '1 to 2', '2 to 3', '3 to 4', '4 to 5', '5 to 6', '6 to 7', '7 to 8', '8 to 9'))+ 
  ylab ("")+
  xlab("fWAR")+
  # updating to removes y-axis counts and ticks
  theme(axis.text.y = element_blank()) + 
  theme(axis.ticks.y = element_blank()) + 
  theme(axis.title.y = element_text(color="#993333", size=13, face="bold"))+
  theme(axis.title.x = element_text(color="#993333", size=13, face="bold"))+
  theme(plot.title = element_text(color="Dark Red", size=14, face="bold.italic"))+
  theme(axis.text.x = element_text(color = "dark red", size = 9, face ="bold"))```

jonesey441 · September 29, 2022, 10:22pm

@Dan0o0o it looks like you're creating the fWAR variable twice now. You do it once in the case_when() function to create the bins, and then again in the summarize() function which might be overwriting something?

And also your ggplot() call is still using the old WAR21 variable instead of your subset, which basically isn't using your new dataframe.

system · October 20, 2022, 10:23pm

This topic was automatically closed 21 days after the last reply. New replies are no longer allowed.

If you have a query related to it or one of the replies, start a new topic and refer back with a link.