Linear regression line looks wrong

I made a plot for this soccer blog I've started. It's measuring the number of saves goalkeepers make against the number of shots they've faced.

See the two R charts and a table of data here:

The plot points look right but the line doesn't. The average shot to save percentage is 0.68, which I thought meant that would be what the line would run through. FOr example, Jaakkola in the top right is right on on the line when he's above average in terms of numbers. Likewise, Dawson is well left of the line when he's about bang on average.

Here's the code. Can anyone see any likely causes?

<<<<
ggplot(EFL_keepers_Save_per_game_ratios,
mapping = aes(x = Saves_per_game, y = Shots_per_game)) +
geom_point(aes(size = Shots)) +
stat_smooth(method = "lm", se = FALSE) +
labs(x = "Saves per game", y = "Shots faced per game") +
geom_text(aes(label = Name),hjust = 0, vjust = 0)
<<<<

Apologies, I'm not sure how to present that code chunk.

I could potentially provide the data as well for a reprex.

Antony

Yes, a reprex would help.

Re. formatting, see the FAQ, below.

what are the shots and saves for Jaakkola?

There are a few things going on which may be causing this to "not look right"

  1. The X and Y axis ranges are different so a line at 45 degrees does not mean a 50% ratio
  2. lm is an OLS regression. OLS of y ~ x minimizes error perpendicular to the independent axis. So you get a different answer if you reverse the x and y. That's the nature of OLS. You may be expecting Total Least Squares (TLS) which is more like Principle Component Analysis (PCA).

Since I'm not sure of your preconceived notions, it's hard to guess what might be causing it to not look right to you. Maybe you could share examples what what you would expect vs. what you see.

1 Like

I can describe what I expect to see.

The mean save %age for all keepers is about 0.68, so I would expect the line to trace along that value.

Jaakkola is 0.76. I'd expect him to be to the right of the line, but he's right on it. Meanwhile, at the other end of the line, but also on the line, is Archer who rates about 0.61.

So, the line goes from 0.61 to 0.76, not along 0.68 as I'd expect.

Here's that bit of code again, but it falls short of a reprex as I can't make sense of how to put my dataset on here:

mapping = aes(x = Saves_per_game, y = Shots_per_game)) +
geom_point(aes(size = Shots)) +
stat_smooth(method = "lm", se = FALSE) +
labs(x = "Saves per game", y = "Shots faced per game") +
geom_text(aes(label = Name),hjust = 0, vjust = 0)```

A regression of x ~ y (in your case Saves_per_game ~ Shots_per_game) is not guaranteed to have a slope equal to x/y which seems to be your intuition. The reason why is that the regression has an intercept that is not zero. If you force the intercept to be through 0,0 you would get a slope equal to the ratio of the variables.

In your regression the coefficient of y can be interpreted as "for each additional shot taken, how many additional saves would be expected?"

If you really want to see what a standout Jaakkola is, you need to build a model without Jaakkola in the sample and then show what he would have been expected to do based on that model compared to what he actually did. You're currently building a model where Jaakkola is part of the "expected", i.e. the training set.

That makes sense. Thank you. So, how do I force the intercept to go through 0,0 in R?

(PS, Jaakkola is not my focus here. He was an illustration of someone I thought was in a strange position relative to what I thought the line should do)

I think this would do it. It's a little tricky because the modeling built into ggplot does not make it easy to omit the intercept.

A simplified way of doing this, however, would be to calculate the desired slope of the line outside of ggplot then use geom_abline to just plot the line on the graph.

This topic was automatically closed 21 days after the last reply. New replies are no longer allowed.