Model that predicts Statcast Strikes

Hi Guys,

I was recently asked to develop a plot to help communicate a model of what predicts strikes in the NY Mets 2021 Statcast data set.

My professor recommended trying something such as a series of pie or bar charts that express conditional probabilities, or something such as a coefficient/"ladder" plot

he also noted that(there are R packages that will get ne started, such as arm), using geom_smooth()with a method that works for logistic regression

So where can I start? I'm assuming I have to download a data package from baseball savant but from there I am somewhat lost.

In more formal terms, a strike is a response or outcome variable (sometimes called a dependent variable) recorded as 1 as a strike or 0 as a non-strike.

The first thing to do is to decide what, besides a ball constitutes a non-strike. A hit? Hitting the batter? Maybe the data set already has made that decision for you.

To "predict" a strike, we consider a number of treatment variables (also called independent variable). Home/away, W/L record to date for season, ERA, batter's base on balls, etc. are candidates. There's a host of other variables, some of which might actually show an association, such as the shortstop's batting average, but don't go there without some deep thinking about causal analysis.

Conventionally, the response variable is called Y and the treatments X_i \dots X_n. and the goal is to determine the conditional probability of Y given X_i \dots X_n. Let's strikes be Y and the X variables be X1 = ERA, X2 = inning, and X3 = home/away.

mod <- glm(Y ~ X1 + X2 + X3, data = YOUR_DATA, family="binomial")
summary(mod)

Or, if you've subsetted the data so that it just includes those four variables

mod <- glm(Y ~ ., data = YOUR_DATA, family="binomial")
summary(mod)

where . is everything else besides Y.

From there it can become tough sledding. As explained here logistic regression doesn't work as expected perhaps coming from linear regression. For example geom_smooth() really wants something continuous to work with.

The first thing to do after downloading is the understand what all the variables measure, whether they are continuous, binary, logical, categorical or, perhaps, just comments. Then comes exploratory data analysis, where well-selected plots can help.

It's hard to suggest more without knowing where you are at in terms of statistical and R experience.

thank you for this, I appreciate the lengthy response. If you have the time, can I send you the code we worked on in class? Perhaps that will give you a better idea of how it's currently structured.

#homework

my current work:

library(tidyverse)

#ORGANIZE DIRECTORY STRUCTURE so you dont have to use file.choose
mets <- read_csv("Mets2021.csv")

How do we know if a pitch was called a strike or not?

We need this for a logistic regression, where the outcome is binary

colnames(mets)
summary(mets$description)
unique(mets$description)
unique(mets$type)

mets %>% select(description, type)

you have options:

the description variable allows you to be more specific, e.g. choosing to only analyze "swinging strikes"

the type variable is easier to code, and gives a more general answer

you have to choose which is better for your purposes.

types_of_strikes <- c(
"swinging_strike", "called_strike", "swinging_strike_blocked")
mets$swinging_strike <- ifelse(mets$description == "swinging_strike", 1, 0)
mets$any_strike <- ifelse(
mets$description %in% types_of_strikes, 1, 0)
mets %>% select(description, type, swinging_strike, any_strike) %>% print(n = 20)

m1 <- glm(

formula

any_strike ~ effective_speed,

which data?

data = mets,

which probability model?

family = "binomial"
)

m2 <- glm(
## formula
any_strike ~ effective_speed + spin_axis + release_spin_rate,
## which data?
data = mets,
## which probability model?
family = "binomial"
)

m3 <- update(m2, . ~ . + pitch_type)

summary(m1)
summary(m2)
summary(m3)

AIC(m1, m2, m3)

sort(unique(mets$pitch_type))

mets %>% select(player_name, p_throws)

How do we compare pitchers to each other?

You want a "split-apply(-combine)" workflow

dim(mets)
gmets <- mets %>% group_by(player_name)

apply

gmnets %>% summarize(mean_speed = mean(release_speed, na.rm = TRUE))

Regarding statistical and R experience I would say I am a few months in but I am still not sure what I am doing

I can't check this without data, but eyeballing looks ok.

This topic was automatically closed 21 days after the last reply. New replies are no longer allowed.

If you have a query related to it or one of the replies, start a new topic and refer back with a link.