Understanding Partial Dependancy Plots

Hi,

I'm currently trying to wrap my head around partial dependency plots and i have two questions i was hoping someone could help me with
I have found a number of definitions...

  • "Suppose you would like to understand importance of variable p_i in the model, PDP builds the model averaging other predictor variable except one choosen predictor variable p_i and measures change in response yhat and y, change in response can help identify how a variable is affecting the model" 1

  • "The idea is that the function f(x) tells us how the value of the variable x_j influences the model predictions \hat y after we have “averaged out” the influence of all other variables" 2

  • "The partial dependence plot (short PDP or PD plot) shows the marginal effect one or two features have on the predicted outcome of a machine learning model"3

I think I am getting bogged down in the technical speak about distributions so was hoping to get a layman description of what is happening

From my understanding we take a predictor and change the value of that predictor across a range of values for each row of data. These values are determined by a grid. Depending on the implementation, all other variables are held constant (either at a median or average etc). The mean for all the rows is taken at a single point on the axis. If you plot the individual rows themselves you get an ICE plot. So we see the change in the mean of predictor for all rows in the dataset across a range of pre-determined values and record the model output as that predictor changes. If its a regression problem like the cost of a house, the house price should vary 3. If you get negative values here its because you are getting a result less than the average of the house. If its a classification problem the probability of an instance belonging to a particular class should vary 3.

My questions might seem a bit basic but i hope you can help. Taking the example from 4 I replicated the PDP graphs for the Iris dataset to try and predict if a class is setosa or not. I plotted the results and have a point x=5.041176, y = 0.35345465 on the graphs

  • Is my understanding of PDPs above accurate

  • Can someone explain to me what “averaged out” and "marginal distributions" actually mean

  • In relation to my understanding of interpretation, if my sepal length is 5.041176, then the probability that i have a setosa on my hands is on average 35%, all other attributes being held constant

  • At this point x=5.041176, y = 0.35345465 what are all my other attributes held at; for example Sepal.Width, Petal.Length and Petal.Width. Can some one show me a quick calculation in R based on the data?

Thank you very much for your time.

library(tidyverse)
library(e1071)
library(pdp)

mydf <- iris %>% 
  mutate(tgt = factor(ifelse(Species == 'setosa', "yes", "no"))) %>% 
  select(-Species)

pred.prob <- function(object, newdata) {
  pred <- predict(object, newdata, probability = TRUE)
  prob.setosa <- attr(pred, which = "probabilities")[, "yes"]
  mean(prob.setosa)
}

svm_mod <- svm(tgt ~ ., data = mydf, kernel = "radial", gamma = 0.75,
               cost = 0.25, probability = TRUE)
pdp_graph  <- partial(svm_mod, pred.var = c("Sepal.Length"), probs = TRUE, pred.fun = pred.prob)

ggplot(pdp_graph, aes(x=Sepal.Length, y=yhat)) +
  geom_line() +
  geom_point(x=5.041176, y = 0.35345465, size = 4) +
  theme_minimal()

Created on 2019-03-27 by the reprex package (v0.2.1)

Reading through the code in your first reference (dpmartin42), this is my understanding of Partial Dependence. Imagine you have a data set with the outcome Y, 10 predictors and 100 observations. We will call the predictors V1, V2, ...V10. You have fit a model M for predicting Y from V1:V10.
You now want to characterize the partial dependence of V1 over the range V1 = 1 to V1 = 5 in steps of 0.1, so you have 51 values of V1. Take the values of V2:V10 and make 51 copies. For each of these copies, append a single value of V1 so that the given value of V1 will be matched with all 100 available values of V2:V10. Use model M to predict Y for each of the 100 rows. The average value of Y from these 100 fits determines yhat for this value of V1. Repeating this 50 more times builds up the partial dependency plot.

A quick-and-dirty way to approximate this is to set each variable in V2:V10 to its median or mean. Then you only have to do the fit 51 times, one for each desired value of V1, instead of 5100 times.

2 Likes

Ah, I see.
It was the aproximation of the the mean or mode of the other variables that was throwing me

So in essence, if you have computing power you should make the copies of the dataset for each grid item bu if you are stuck on computer processing you can use the approximation with the mean or mode of the other columns ?

Yes, that is my understanding also. It is easy to see that the computational requirements can be huge, especially if you want to look at a combination of variables.

1 Like

Thank you very much for your time. I have been struggling with this for a few days and it was nice to bounce it off of someone

This topic was automatically closed 7 days after the last reply. New replies are no longer allowed.

If you have a query related to it or one of the replies, start a new topic and refer back with a link.