Analyzing Baseball Team Data with R using the tidyverse

dplyr
tidyverse

#1

I'm in the process of analyzing baseball team data in R. I have historical team stats going back a while. All of the stat categories make up the columns, with the season and teams residing in rows (data is broken up by year).

I'm looking to create 2 data.frames from these stats.

  1. Summary of Stats to include: Weighted Average of each column based on a weighting of (.4 to 2018, .35 to 2017, and .25 to 2016), Most current 3 year average (in this case we are talking 2016-2018), and All Year Average
  2. Slope of the results of the data.frame in 1.

I've started my code, but got stuck. Here's what I have so far. I'm probably better off using summarize_at so that I can skip over the team names, and season column.

library(tidyverse)

tibble::tribble(
                  ~season,       ~team,  ~hr,  ~sb,   ~w, ~sv,
                    2015L, "Blue Jays", 232L,  88L,  93L, 34L,
                    2016L,   "Red Sox", 208L,  83L,  93L, 43L,
                    2017L,    "Astros", 238L,  98L, 101L, 45L,
                    2018L, "Athletics", 227L,  35L,  97L, 44L,
                    2018L,   "Red Sox", 208L, 125L, 108L, 46L,
                    2018L,   "Yankees", 267L,  63L, 100L, 49L,
                    2017L,   "Yankees", 241L,  90L,  91L, 36L,
                    2018L,   "Indians", 216L, 135L,  91L, 41L,
                    2017L,   "Indians", 212L,  88L, 102L, 37L,
                    2018L,    "Astros", 205L,  71L, 103L, 46L
                  )
#> # A tibble: 10 x 6
#>    season team         hr    sb     w    sv
#>     <int> <chr>     <int> <int> <int> <int>
#>  1   2015 Blue Jays   232    88    93    34
#>  2   2016 Red Sox     208    83    93    43
#>  3   2017 Astros      238    98   101    45
#>  4   2018 Athletics   227    35    97    44
#>  5   2018 Red Sox     208   125   108    46
#>  6   2018 Yankees     267    63   100    49
#>  7   2017 Yankees     241    90    91    36
#>  8   2018 Indians     216   135    91    41
#>  9   2017 Indians     212    88   102    37
#> 10   2018 Astros      205    71   103    46

summary_table <-  league_stats %>%
  mutate(weight = case_when(
    season == 2018 ~ .4,
    season == 2017 ~ .35,
    season == 2016 ~ .25,
    TRUE ~ NA_real_
  )) %>%
  summarize_all(funs(mean))

The last step is that I'd like to use the results of the 2nd Table in a number table. For example, multipy the slope of column HR times HR for a projection. The new data.frame would be projections data with the same columns.


#2

Just checking in to see if my question was clear enough. I'd be welcome to further dialogue if it isn't. Thanks in advance!


#3

What is your grouping variable for this?

It'll be easier to help you accomplish what you're describing if you give us the code you're currently trying to use to accomplish what you're describing, because it's a little unclear what exactly you're trying to summarize in that last table in your reprex (i.e. I don't think what I've done below is what you're going for, but I'm confused as to what the first and second tables you're describing are.

What do you mean by a number table? Is the "slope of column HR" something you've calculated already, or si that from an earlier step?

library(tidyverse)

league_stats <- tibble::tribble(
  ~season,       ~team,  ~hr,  ~sb,   ~w, ~sv,
  2015L, "Blue Jays", 232L,  88L,  93L, 34L,
  2016L,   "Red Sox", 208L,  83L,  93L, 43L,
  2017L,    "Astros", 238L,  98L, 101L, 45L,
  2018L, "Athletics", 227L,  35L,  97L, 44L,
  2018L,   "Red Sox", 208L, 125L, 108L, 46L,
  2018L,   "Yankees", 267L,  63L, 100L, 49L,
  2017L,   "Yankees", 241L,  90L,  91L, 36L,
  2018L,   "Indians", 216L, 135L,  91L, 41L,
  2017L,   "Indians", 212L,  88L, 102L, 37L,
  2018L,    "Astros", 205L,  71L, 103L, 46L
)

summary_table <-  league_stats %>%
  mutate(weight = case_when(
    season == 2018 ~ .4,
    season == 2017 ~ .35,
    season == 2016 ~ .25,
    TRUE ~ NA_real_
  )) %>%
  summarize_at(c("hr", "sb", "w", "sv", "weight"), funs(mean), na.rm = TRUE)

summary_table
#> # A tibble: 1 x 5
#>      hr    sb     w    sv weight
#>   <dbl> <dbl> <dbl> <dbl>  <dbl>
#> 1  225.  87.6  97.9  42.1  0.367

Created on 2018-12-17 by the reprex package (v0.2.1.9000)

Take a stab at the code, and it'll be easier to diagnose problems with it. :slightly_smiling_face:


#4

I didn't realize how unclear my initial question was until your response! Sorry about that. Let me try to take another stab at this. I've pasted some reformatted code below.

library(tidyverse)

league_stats <- tibble::tribble(
  ~season,       ~team,  ~hr,  ~sb, ~rbi,   ~w, ~sv, ~era,
  2016L,   "Red Sox", 208L,  83L, 836L,  93L, 43L,    4,
  2017L,    "Astros", 238L,  98L, 854L, 101L, 45L, 4.12,
  2018L, "Athletics", 227L,  35L, 778L,  97L, 44L, 3.82,
  2018L,   "Red Sox", 208L, 125L, 829L, 108L, 46L, 3.75,
  2018L,   "Yankees", 267L,  63L, 821L, 100L, 49L, 3.78,
  2017L,   "Yankees", 241L,  90L, 821L,  91L, 36L, 3.75,
  2018L,   "Indians", 216L, 135L, 786L,  91L, 41L, 3.77,
  2017L,   "Indians", 212L,  88L, 780L, 102L, 37L,  3.3,
  2018L,    "Astros", 205L,  71L, 763L, 103L, 46L, 3.11,
  2015L,    "Astros", 230L, 121L, 691L,  86L, 39L, 3.57,
  2016L,   "Indians", 185L, 134L, 733L,  94L, 37L, 3.86,
  2016L,    "Astros", 198L, 102L, 689L,  84L, 44L, 4.06,
  2015L,   "Yankees", 212L,  63L, 737L,  87L, 48L, 4.03,
  2015L,   "Indians", 141L,  86L, 640L,  81L, 38L, 3.68,
  2015L,   "Red Sox", 161L,  71L, 706L,  78L, 40L, 4.34,
  2017L,   "Red Sox", 168L, 106L, 735L,  93L, 39L, 3.73,
  2017L, "Athletics", 234L,  57L, 708L,  75L, 35L, 4.67,
  2016L,   "Yankees", 183L,  72L, 647L,  84L, 48L, 4.16,
  2015L, "Athletics", 146L,  78L, 661L,  68L, 28L, 4.16,
  2016L, "Athletics", 169L,  50L, 634L,  69L, 42L, 4.51
)

league_stats
#> # A tibble: 20 x 8
#>    season team         hr    sb   rbi     w    sv   era
#>     <int> <chr>     <int> <int> <int> <int> <int> <dbl>
#>  1   2016 Red Sox     208    83   836    93    43  4   
#>  2   2017 Astros      238    98   854   101    45  4.12
#>  3   2018 Athletics   227    35   778    97    44  3.82
#>  4   2018 Red Sox     208   125   829   108    46  3.75
#>  5   2018 Yankees     267    63   821   100    49  3.78
#>  6   2017 Yankees     241    90   821    91    36  3.75
#>  7   2018 Indians     216   135   786    91    41  3.77
#>  8   2017 Indians     212    88   780   102    37  3.3 
#>  9   2018 Astros      205    71   763   103    46  3.11
#> 10   2015 Astros      230   121   691    86    39  3.57
#> 11   2016 Indians     185   134   733    94    37  3.86
#> 12   2016 Astros      198   102   689    84    44  4.06
#> 13   2015 Yankees     212    63   737    87    48  4.03
#> 14   2015 Indians     141    86   640    81    38  3.68
#> 15   2015 Red Sox     161    71   706    78    40  4.34
#> 16   2017 Red Sox     168   106   735    93    39  3.73
#> 17   2017 Athletics   234    57   708    75    35  4.67
#> 18   2016 Yankees     183    72   647    84    48  4.16
#> 19   2015 Athletics   146    78   661    68    28  4.16
#> 20   2016 Athletics   169    50   634    69    42  4.51

long_league_stats <- league_stats %>%
  select(team, season, hr, sb, rbi, w, sv, era)  %>%
  gather(key = stat, value = "value", 3:8 ) %>%
  mutate(weight = case_when(
    season == 2018 ~ .5,
    season == 2017 ~ .3,
    season == 2016 ~ .2,
    TRUE ~ NA_real_
  )) %>%
  group_by(season, stat) %>%
  mutate(rank = if_else(stat == "era", rank(value, ties.method = "first"), rank(-value, ties.method = "first")),
         points = if_else(stat =="era", rank(-value, ties.method = "first"), rank(value, ties.method = "first")),
         wtd_value = value * weight) %>%
  ungroup() %>%
  group_by(stat, rank) %>%
  summarize(avg_value = mean(value, na.rm = TRUE),
            wtd_value = sum(wtd_value, na.rm = TRUE)) %>%
  ungroup() 

long_league_stats
#> # A tibble: 30 x 4
#>    stat   rank avg_value wtd_value
#>    <chr> <int>     <dbl>     <dbl>
#>  1 era       1      3.46      3.32
#>  2 era       2      3.79      3.79
#>  3 era       3      3.90      3.82
#>  4 era       4      4.06      3.96
#>  5 era       5      4.34      4.21
#>  6 hr        1    236.      247.  
#>  7 hr        2    219.      224.  
#>  8 hr        3    199       215.  
#>  9 hr        4    187.      204.  
#> 10 hr        5    171.      187.  
#> # ... with 20 more rows

Created on 2018-12-17 by the reprex package (v0.2.1)

I'm basically looking to create 2 df. The first one's code is listed above. It is taking the aggregated baseball stats by season, and ranking the values by statistic category. I am then creating my summary stats (weighted average, mean, etc). You'll notice that ERA is ranked differently than the rest of the stats. That's intentional (for those non baseball fans).

  1. First off, is a long df the right way to go vs. a wide df?
  2. How would I make this code more flexible so that I don't have to type the years in every year? The weighted average would be the last 3 years (season that just finished plus prior 2)?
  3. How can I add a last 3 year average the same way?
  4. How do I get the slope of the avg_value and wtd_value columns?
  5. How do I apply the results of 4 to a separate df. Let's say the named columns for 4 are slope_category_name. I.E. slope_HR. How could I use that result in a separate calculation? Ex. HR * slope_HR ?

Hope this makes more sense now, and thanks again!


#5

I don't have time to run through all of the Qs right now, but a couple thoughts:

It totally depends, but, given it sounds like you're going to want to use those stats inside calculated variables later, you might want to have them as headers (like you did in your initial example)

If you're thinking of "last" as being the highest three numbers you could use max(year), (max(year) - 1, etc. Something like

  mutate(weight = case_when(
    season == max(season) ~ .5,
    season == (max(season) - 1) ~ .3,
    season == (max(season) - 2) ~ .2,
    TRUE ~ NA_real_
  )

If you're thinking of "last" as relative to the current year, you can calculate that relative to the year portion of Sys.Date(), or something to that effect.

Filter on year, and calculate the weighted average as you just did.

I'm still not clear what you mean by slope. I assume you're referencing a model you're fitting, but I'm not clear on what your outcome variable is. As for looking at model fit using a variety of vars, you might want to check out this chapter of R4DS:


#6

I don't know if I just repeat what @mara had said but here goes.

First off, is a long df the right way to go vs. a wide df ?

Doesn't matter as we have tidyr package anyway to reshape it. But I think league_stats is already in a tidy format rather than long_leagus_stats

How would I make this code more flexible so that I don't have to type the years in every year? The weighted average would be the last 3 years (season that just finished plus prior 2).

In your data, I take that you only want to do a weighted mean on year 2018, 2017 and 2016.

league_stats %>% 
  arrange(team, season) %>% # just so that we can see it sorted
  filter(season > max(season) - 3) %>% # take the last season and 2 seasons prior
  mutate(weighted_mean = weighted.mean(x = sv, w = w)) # weighted mean

But, let us know if you want to do a moving average. By that case, you are doing 2 average: for (2018, 2017, 2016) and (2017, 2016, 2015).

How do I get the slope of the avg_value and wtd_value columns?

What slope? more detail is appreciated.

How do I apply the results of 4 to a separate df . Let's say the named columns for 4 are slope_category_name . I.E. slope_HR . How could I use that result in a separate calculation? Ex. HR * slope_HR ?

broom package will help in this problem. But, if you only need the slope number, I think it would be better to store the slope number in an object and then multiply them to your desired column.


#7

@rexevan and @mara- Thanks for your responses.

@mara's suggestion to use max should solve that issue. I want this code to be usable next year without having to manually type in the years for the dataset.

Yes, I'm looking to do a weighted mean based on defined weights for max season = .5, etc. As @mara listed above. The other thing I'm looking to see is a straight 3 year average.

My plan is to then get the slope of the weighted average, average, and 3 year average for all categories. Something like this:

slope <- league_stats %>%
  group_by(stat) %>%
  do(tidy(lm(avg_value~ points, data = .))) %>%
  filter(term == "points") %>% 
  select(stat, estimate) %>%
  rename(average_slope = estimate)

I want to use the value that comes from there in each calculation for stats for a table. For example.

slope_stats_table <- league_stats %>%
  mutate_at(HR = slope_hr * hr)

Hope that makes sense.

Thanks again!


#8

Why don't you take a whack at these, and, if you get stuck, we can look at a reprex and help you modify? The un-weighted average would just be the average without using the weights. If you're sticking with the wide format (which is still what I think I'd recommend in this case), you might take a look at scoped summarize_*() variants.

As @rexevan mentioned, broom is definitely your friend here, and in the linked R4DS sections from above, there are some good examples of how to use broom and modelr to get the kind of data you're looking for.


#9

@mara and @rexevan. Here's the code I've come up with. Is there a more efficient way to get to this?

Also, say I want to use the variables in the slope table in a separate calculations. (The value for the slope of weighted and HR). Is there an effective way to do this?

library(tidyverse)
library(broom)

league_stats <- tibble::tribble(
  ~season,       ~team,  ~hr,  ~sb, ~rbi,   ~w, ~sv, ~era,
  2016L,   "Red Sox", 208L,  83L, 836L,  93L, 43L,    4,
  2017L,    "Astros", 238L,  98L, 854L, 101L, 45L, 4.12,
  2018L, "Athletics", 227L,  35L, 778L,  97L, 44L, 3.82,
  2018L,   "Red Sox", 208L, 125L, 829L, 108L, 46L, 3.75,
  2018L,   "Yankees", 267L,  63L, 821L, 100L, 49L, 3.78,
  2017L,   "Yankees", 241L,  90L, 821L,  91L, 36L, 3.75,
  2018L,   "Indians", 216L, 135L, 786L,  91L, 41L, 3.77,
  2017L,   "Indians", 212L,  88L, 780L, 102L, 37L,  3.3,
  2018L,    "Astros", 205L,  71L, 763L, 103L, 46L, 3.11,
  2015L,    "Astros", 230L, 121L, 691L,  86L, 39L, 3.57,
  2016L,   "Indians", 185L, 134L, 733L,  94L, 37L, 3.86,
  2016L,    "Astros", 198L, 102L, 689L,  84L, 44L, 4.06,
  2015L,   "Yankees", 212L,  63L, 737L,  87L, 48L, 4.03,
  2015L,   "Indians", 141L,  86L, 640L,  81L, 38L, 3.68,
  2015L,   "Red Sox", 161L,  71L, 706L,  78L, 40L, 4.34,
  2017L,   "Red Sox", 168L, 106L, 735L,  93L, 39L, 3.73,
  2017L, "Athletics", 234L,  57L, 708L,  75L, 35L, 4.67,
  2016L,   "Yankees", 183L,  72L, 647L,  84L, 48L, 4.16,
  2015L, "Athletics", 146L,  78L, 661L,  68L, 28L, 4.16,
  2016L, "Athletics", 169L,  50L, 634L,  69L, 42L, 4.51
)

long_league_stats <- league_stats %>%
  select(team, season, hr, sb, rbi, w, sv, era)  %>%
  gather(key = stat, value = "value", 3:8 ) %>%
  mutate(weight = case_when(
    season == max(season) ~ .45,
    season == (max(season) - 1) ~ .35,
    season == (max(season) - 2) ~ .20,
    season <=  (max(season) - 3) ~ .00,
    TRUE ~ NA_real_
  )) 


#Ranks by Season Long
ranks_by_season_long <- long_league_stats %>%
  group_by(season, stat) %>%
  mutate(rank = if_else(stat == "era", rank(value, ties.method = "first"), rank(-value, ties.method = "first")),
         points = if_else(stat == "era", rank(-rank, ties.method = "first"), rank(-rank, ties.method = "first")),
         wtd_value = value * weight) %>%
  ungroup()

ranks_by_season <- ranks_by_season_long %>%
  select(season, stat, rank, value, points) %>%
  spread(stat, value) %>%
  mutate(season = as.character(as.numeric(season)))

weighted <- ranks_by_season_long %>%
  group_by(stat, rank, points) %>%
  summarize(weighted = sum(wtd_value)) %>%
  spread(stat, weighted) %>%
  mutate(season = "weighted")

average_3yr <- ranks_by_season_long %>%
  filter(season > max(season) - 3) %>%
  group_by(stat, rank, points) %>%
  summarize(avg_3yr = sum(value / 3)) %>%
  spread(stat, avg_3yr) %>%
  mutate(season = "avg_3yr")


stats_table_long <- bind_rows(ranks_by_season, weighted, average_3yr) %>%
  gather(key = stat, value = value, era: w)

stats_table_long
#> # A tibble: 180 x 5
#>    season  rank points stat  value
#>    <chr>  <int>  <int> <chr> <dbl>
#>  1 2015       1      5 era    3.57
#>  2 2015       2      4 era    3.68
#>  3 2015       3      3 era    4.03
#>  4 2015       4      2 era    4.16
#>  5 2015       5      1 era    4.34
#>  6 2016       1      5 era    3.86
#>  7 2016       2      4 era    4   
#>  8 2016       3      3 era    4.06
#>  9 2016       4      2 era    4.16
#> 10 2016       5      1 era    4.51
#> # ... with 170 more rows

slope <- stats_table_long %>%
  group_by(stat, season) %>%
  do(tidy(lm(value~ points, data = .))) %>%
  filter(term == "points") %>% 
  select(season, stat, estimate) %>%
  rename(slope = estimate) %>%
  spread(stat, slope) %>%
  select(season, hr, sb, rbi, w, sv, era)
  
slope
#> # A tibble: 6 x 7
#> # Groups:   season [6]
#>   season      hr    sb   rbi     w    sv    era
#>   <chr>    <dbl> <dbl> <dbl> <dbl> <dbl>  <dbl>
#> 1 2015     24.4   13.1  23.9  4.6   4.20 -0.202
#> 2 2016      9.30  19.8  49.0  5.9   2.40 -0.146
#> 3 2017     17.2   10.8  37.8  6.4   2.30 -0.313
#> 4 2018     14.3   26.2  17.5  4.00  1.80 -0.145
#> 5 avg_3yr  13.6   18.9  34.8  5.43  2.17 -0.201
#> 6 weighted 14.3   19.5  30.9  5.22  2.09 -0.204

#10

Great! It would be better if you could add comments in the script describing what each chuck of code does. It's nice that it works, but be aware that the future you -- 1 year from now -- will read that chuck of script again.


#11

This topic was automatically closed 21 days after the last reply. New replies are no longer allowed.