August 25, 2003
Correlation and DIPS
Dr. Manhattan at Blissful Knowledge writes a post about the recent study by Tom Tippett at Diamond Mind Baseball on the Voros McCracken's DIPS theory. He's not sure about something:
McCracken’s theory had value beyond what it said about pitchers’ performance. By largely removing the influence of pitchers from the results of balls put into play, it also provided the justification for the foundation of Bill James’ and Baseball Prospectus’ methods of measuring fielding performance. But now, it seems like we have taken several steps backward. To use an example cited in the Diamond Mind study, in measuring the defensive performance of the Seattle Mariners over the last several years, don’t you have to adjust for the influence of Jamie Moyer? And again, doesn’t that merely reopen the “Hibernian Problem” of distinguishing pitching from defense?
Maybe I’m missing something here, but I’m not sure what it is.
I meant to talk about this at the time of the Diamond Mind article, but something else took my attention and I never got back to it. I think the most interesting part of the Tippett article is this section entitled, "Year-to-year variations, part two."
It goes without saying that one cannot prove or disprove the idea that "there is little correlation between what a pitcher does one year in the stat and what he will do the next" by examining only ten or twelve careers.
To get a better handle on this phenomenon, I compiled a database consisting of all pairs of consecutive seasons in which a pitcher faced at least 400 batters in each season. Using this sample of 7,486 season-pairs, I computed the correlation coefficient for the net HBP rate, BB rate, K rate, HR rate, and in-play hit rate.
I found the highest correlation (.73) for strikeout rates. Walk rates (.66) were also highly correlated. The correlation coefficients dropped to .36 for hit batsmen, .29 for homeruns, and .16 for in-play batting average relative to the league. The lowest correlation (.09) was seen for in-play batting average relative to the team.
It may appear to be contradictory to say that certain pitchers appear to be consistently good while the overall correlation rate is quite low. But that's not necessarily so.
If McCracken is right, the difference between a pitcher's IPAvg and that of his team should vary randomly around zero as he moves through his career, and the correlation would be quite weak.
But if pitchers do have some influence over these outcomes, they could still exhibit a weak correlation by varying around some value other than zero that reflects the ability of the pitcher.
(Emphasis added by me.)
What Tippett is saying here is that you can predict strikeout rates pretty well just by looking at the previous season of the pitcher, but you can't predict -play batting average relative to the team well at all. That's what correlation means. Correlation goes on a scale of -1 to 1, where 1 is perfect correlation (the best at one will be the best at the other), -1 is perfect opposite correlation (the best at one will be the worst at the other) and 0 means no correlation at all; in other words, being the best at one will tell us nothing about how you do at the other. The statistican I learned from used to tell me that if he sees .5 correlation, he assumes the data is random. Seeing a .09 correlation tells me the data is very random. It's not 0, but it's very close to 0.
So, as to Dr. Manhattan's question; yes, you are missing something. The effect Tippett is showing is small, so small that DIPS is still valid. Bill James knew about this when he wrote win shares, but for the aggregate I think it works really well. We don't have to reopen the “Hibernian Problem”; we just have to understand that the solution is just an approximation.
Correction: The Hibernian Problem is a typo in the original Blissful Knowledge entry. Here's what the problem is:
I've just fixed my post. Sorry - my BP2K is in storage so I couldn't double-check it. the correct term is "Hibert" problems. Those problems were a list of 23 fundamental mathematical problems propounded around the turn of the century by a mathemetician nmaed Hibert. In BP2K, Keith Woolner tried his hand at setting out a list of parallel questions, and a primary one was the distinction of pitching and defense. The piece helped inspire Voros McCracken.
Correction II: The name of the mathematician is Hilbert, not Hibert. Thanks to Mike Malloy for catching this.
Posted by David Pinto at
09:03 AM
|
Statistics
|
TrackBack (0)