Baseball Musings
Baseball Musings
April 24, 2005
Learning Power

Michael Lewis, the author of Moneyball, has an excellent piece in today's NY Times magazine. It's about how the quest for power can corrupt players, moving them from what they do well (getting on base, hitting to the opposite field), to what they do poorly (small players trying to hit for power).

It's one of the best written articles on baseball I've read all year. It's long but it's more that worth the read. At the heart of the story is the idea that power hitting can be learned. I have a copy of a study by Sig Mejdal, a NASA biomathematician. Sig was able to discern the effects of age and experience on offensive stats. Sig's work shows that players reach peak HR age at 27. But experience (how long you've been in the majors), keeps adding to a player's ability to hit home runs, even after ten years. So maybe we shouldn't be so surprised about Roberts and Belliard. Maybe baseball should just let good hitters develop, and see if the home runs come with time.

Make sure you look at the picture of Steve Stanley. Take a good look at the size of his arms. Here's a person who's never used steroids, and look at the size of his muscles. He's a 5' 7" monster. Looks to me like you can grow a lot of muscle with a good exercise regimen.


Posted by David Pinto at 12:43 PM | Management | TrackBack (1)
Comments

Lewis is always a fun read :-)

As he did in his Moneyball book, Lewis asserts that a point of OBP is much more valuable than a point of SLG: "On-base percentage is between two and three times more valuable than slugging percentage..." I don't buy this. My guess is that whomever made this calculation (did Lewis attribute it to dePodesta in Moneyball?) made the assumption that, given the batter's hitting statistics, each plate appearance is independent of all other PAs. If this were true, we'd almost never see shut-outs or 20-run games (results would be more closely clustered around the expected value). Seeing a team "bat around" in an inning would be extremely rare. In short, correlation is everywhere; if you assume independence when it's not there, you'll draw incorrect conclusions.

I thought it would be fun to test this hypothesis. So, I took the 2004 season team statistics from Yahoo Sports and ran a linear least squares regression on runs using OBP and SLG as inputs. The results contradict Lewis' assertion. The regresison minimized squared difference with the true numbers of runs by giving OBP a weight of -0.64 and SLG a weight of 2.33. i.e. each point of SLG is worth 2.33 runs; each point of OBP is worth -0.64 runs. Does that mean that teams that draw walks make fewer runs? No, not necessarily. Since the model only gets two "knobs" to control, and since SLG and OBP are intertwined, the model could be using a negative weight on OBP to compensate for a lack of input information.

To dig deeper, I ran the same type of regression using walks, singles, doubles, triples and home runs as inputs (calculated from Yahoo Sports 2004 team data). The data doesn't have walks, but you can calculate walks from OBP, AVG and hits. The weights that come out of this will tell us the value, in terms of runs, the model assigns to each event. Also, since the model has more information, it should fit the data better. It does. For the OBP/SLG regression, the model was off by a total of 237.6 runs. The walks/singles/doubles/triples/hr model was off by 185.9 runs. The result again contradicts the hypothesis that OBP is 2-3 times as valuable as SLG. In fact, it says that home runs are far-and-away the best way to score runs. They're more than 5 times more valuable than a single and more than 7 times more valuable than a walk. Here is how many runs the model assigns to each input:

walk: 0.21
single: 0.28
double: 0.31
triple: 0.42
home run: 1.57

You can see why the OBP/SLG model assigned such disparate weights to OBP and SLG: SLG includes home runs, OBP doesn't; and home runs are by far the biggest factor in determining the number of runs scored.

Could the 2004 season have been a fluke? Are HRs really *that* valuable? Are doubles and triples really not worth much more than a single? I grabbed data for the '02 and '03 years and ran the same regression. Here are the results for 2002:

walk: 0.02
single: 0.34
double: 0.42
triple: 0.42
home run: 1.65

This time, walks are almost worthless; again home runs are far better than any other event. Here are the results for 2003:

walk: 0.10
single: 0.13
double: 1.14
triple: 0.90
home run: 1.25

This time the model finds that extra-base hits are all about equally valueable and all worth a lot more than a walk or single. That year (2003), the team that scored the most runs (Red Sox) hit the most doubles (371) and the 2nd largest number of home runs (238). The third-highest-scoring team (Blue Jays) hit the 2nd largest number of doubles (357) and relatively few home runs (190, 10th largest). The Detroit Tigers were second-to-last in runs (591), but 21st in home runs (153) and last in doubles (201). For whatever reason, doubles were better correlated with runs than in other years.

There are a lot of ways to score runs, but OBP doesn't appear to be nearly as valuable as Lewis is claiming...

Posted by: Jason at April 24, 2005 04:56 PM

Good stuff, Jason. If you do the math on the runs created formula, who find that your need to increase your slugging percentage 1.1 points for every 1 point you lower your OBA (that was the old advanced formula). I agree that OBA is not worth 2 to 3 times slugging in terms of scoring runs. It may be, however, that it's much more valuable in predicting who will be a good ballplayer.

Posted by: David Pinto at April 24, 2005 05:25 PM

"It may be, however, that it's much more valuable in predicting who will be a good ballplayer."

That's my thinking too. :-)

Posted by: Jason at April 24, 2005 05:35 PM

Yes, Jason, good stuff. I'm just curious: what p-values do your coefficients have?

Thanks for the information!

Posted by: Robert Tagorda at April 24, 2005 06:26 PM

" SLG includes home runs, OBP doesn't"

It doesn't? I thought it does.

Posted by: wilson at April 24, 2005 06:40 PM

Wilson, you are correct. I missed that reading through the comment the first time. OBP doesn't include the three extra bases.

Posted by: David Pinto at April 24, 2005 06:44 PM

Isn't it clear that SLG is more important than OBP? Just imagine three players 1)who walks everytime (OBP 1.000); 2)who hits single everytime (OBP 1.000, SLG 1.000); 3)who hits homer only, every four times (SLG 1.000). I'm not familiar with the probability of scoring a run from first base, but I would guess it's less than 25%. So it's clear the guy with the home run power has the best chance of producing the most runs.

Am I been naive or ignorant?

Posted by: wilson at April 24, 2005 07:34 PM

what;s with the insisting on making someone who is good at certain things that are valuable try to be good at something else instead???

good leadoff hitters are HARD to find...

and i don't see anything wrong with a third baseman hitting like a second baseman as long as the second baseman hits like the first baseman was supposed to

Posted by: lisa gray at April 24, 2005 07:41 PM

"SLG includes home runs, OBP doesn't"

I misspoke. As we all know, SLG includes 4*(# of HR), OBP includes 1*(# of HR) (silly me!) I should have said something like: the difference between SLG and OBP serves as a rough proxy for # of HR.

Posted by: Jason at April 24, 2005 10:47 PM

Wilson- Not necessarily naive or ignorant, but the nature of the stats causes a couple of things to be overlooked in your formulation--

Although 1. gets no slugging percentage becuase he never hits, those walks do have advancement value; some of them will come with one or more runners on base to be pushed over either by the walk or what the next guy does.

Although 3 drives in a run every fourth time up all by himself, his homer becomes three times as valuable if 1 and 2 are hitting in front of him, because they always get on.

And conversly 3 is never on base for the guys behind him-- if they homer or hit back to back doubles, he doesn't add any value to their achievements.

And finally 1 and 2 deserve a share of 3's achievement because they are giving him a number of extra appearances. The 27 outs are being split among 7 guys instead of 9. I'm not sure I have the math to figure it-- I'm getting 26 extra PA's if all other things are equal and we assume normal 1 & 2 hitters with 350 OBPs... 6 hr and over 18 rbis.

So it looks simpler than it is-- kinda like the game itself... I think the homer is more valuable, but I can't prove it from this example...

Posted by: john swinney at April 24, 2005 11:03 PM

I believe it was Tangotiger who found the best formula to be 1.56 OBP + 1.00SLG. And by "best", I mean best at determining a team's runs scored.

Also, regarding linear weights, these are about what I have seen most frequently (sorry if memory is a little off). All numbers in terms of runs:

Walk: .3
Single: .41
Double: .75
Triple: 1.03
Home Run: 1.41
Stolen Base: .28

Posted by: Mike at April 25, 2005 08:27 AM

Robert: I don't know how I'd calculate a p-value here. I know what it means to calculate the p-value for an event wrt a distribution (I see how David could calculate p-values for his list of (non-)clutch hitters; I know what p-value means wrt a significance test), but I don't see how I'd calculate p-values for linear regression... Disclaimer: My formal statistical training is limited & spotty :)

Posted by: Jason at April 25, 2005 09:36 AM

Wilson, John: IIRC, the study Lewis talks about in the Moneyball book (de Podesta's study, I think) assumes that there is a lineup of identical hitters. My educated guess is that the study also assumes that plate-appearances are independent. In that case, OBP is extremely important. Note that if the lineup has an OBP of 1.000, it scores infinite runs (all in the first inning, btw :). If the lineup is .250 AVG, .250 OBP, 1.000 SLG, they score an average of one run per inning, nine runs per game. I just whipped up a matlab script to draw sample run distributions where all hitters have .000 AVG and some fixed OBP. Here's what I get when I take the average over 100,000 sample innings:

OBP .300 Runs/inning: .32
OBP .400 Runs/inning: 1.1
OBP .500 Runs/inning: 3.2
OBP .600 Runs/inning: 8.2
OBP .700 Runs/inning: 21

Posted by: Jason at April 25, 2005 10:11 AM

I have two articles on my site titled "OPS Begone". Please look for them there:
http://www.tangotiger.net

I also have a 3-part series on How Runs Are Created. I look at 1974-1990 data, game-by-game in article 3.

The best relationship with OBP and SLG to the Linear Weights values is 1.8 OBP + SLG. However, as your run environment changes, so does that relationship. There's another article on my site about Custom Linear Weights.

Posted by: tangotiger at April 25, 2005 10:15 AM

Mike: Here's what I get when I run the regression over all three years (02-04):

walks: 0.11 singles: 0.24 2b: 0.67 3b: 0.48 hr: 1.48

If you combine doubles and triples (which as some have noted are quite similar in their ability to create runs):

walks: 0.12 singles: 0.24 2b3b: 0.65 hr: 1.49

Stolen bases have been inversely correlated with runs of late:

sb: -0.07 walks: 0.12 singles: 0.25 2b3b: 0.64 hr: 1.47

Of course, this doesn't mean that a successfully stolen base is bad. The model is probably using SB as a proxy for CS :-)

Posted by: Jason at April 25, 2005 10:25 AM

Tango: I'm reading your OPS begone articles... it sounds like you're finding OBP/SLG weights that best fit "BaseRuns," which is a formula you've devised. But, there may be little or no relation between your "BaseRuns" and the actual runs produced in real major league baseball games. Btw, I just run regression on OBP and SLG for the 02-04 data. Again, OBP gets a negative weight:

OBP: -0.99 SLG: 2.59

Posted by: Jason at April 25, 2005 10:38 AM

I tried the regression for the top 10 OBP teams over 02-04 (counting different years as different teams); singles and walks get higher weight---the value of walks and singles goes up when you have more men on base (as you'd expect :-)

sb: 0.05 walks: 0.15 singles: 0.29 2b3b: 0.63 hr: 1.29

And SBs help. Maybe high OBP teams have higher SB success rates...

Posted by: Jason at April 25, 2005 11:00 AM

Jason, you said:

" But, there may be little or no relation between your "BaseRuns" and the actual runs produced in real major league baseball games"

Actually, BaseRuns was not developed by me. Please read the "How Runs Are Really Created" articles. You will also find, in Article 3, that BaseRuns models baseball very well.

Just as a little snippet, and not to be used to prove what I'm saying, here's a little table in article 3, where I take each game, and group them by number of HR hit:
Runs Scored, breakdown by HR hit

HRclass n R BsR LWTS RC
0 33,068 3.08 3.06 3.79 3.03
1 23,117 4.62 4.62 4.44 4.66
2 9,218 6.12 6.12 5.00 6.41
3 2,838 7.65 7.65 5.62 8.37
4 687 9.03 9.00 6.07 10.29
5 146 10.55 10.49 6.73 12.45
6 40 12.33 12.32 7.52 15.35
7 9 16.22 14.32 8.34 18.27
8 2 14.00 15.87 8.58 22.52
10 1 18.00 18.30 9.51 27.03


Posted by: tangotiger at April 25, 2005 12:40 PM

Tango: You appear to be assuming independence of plate appearance events. That's a dangerous assumption to make, as it is often violated in practice. Btw, I think your "Run Driving value of the walk" table (Article 1) is wrong. A walk only drives in a runner from 3B if the bases are loaded, but you appear to be assuming that a walk will always drive in a runner from third base.

Only way I know to determine how well something models data is to have the model try to predict future or held-out events. It appears that you're evaluating on the same data that you used to build the model (Article 3). Maybe I'm not reading you correctly? Have you tried predicting runs for a recent season?

I ran the regression on team statistics to avoid having to impose possibly incorrect assumptions.

I think your articles are interesting. I think you have a lot of good insight (e.g. that run value of any non-out goes to 1 as OBP goes to 1). I hope you don't take my comments the wrong way---my goal isn't to be negative, rather I want to understand the inner workings of baseball a bit better :)

Posted by: Jason at April 25, 2005 02:25 PM

No, I don't assume independence. In BaseRuns, the events are completely interdependent. This is why, when you look at my charts of the run values of each event, they are dynamic. Linear Weights assumes independent, as does your regression model.

The "run driving" value is not runs that crossed the plate at that moment in time. I suppose I should have called it "run moving". It's how much runs the event has on the runners already on base. A walk with a runner on 1st moves the runner to 2B, and thereby adds runs that way.

I didn't take your comments in the wrong way.

Posted by: tangotiger at April 25, 2005 03:04 PM

Yes, you capture some of the dependence; it looks to me that you're assuming independence of PA events given # of outs and base runner configuration. There's no dependence on the pitcher, the stadium, the time of day, the weather conditions, what happened in previous innings, etc.

Re: "run driving". Ah, I see what you're doing: the next table only counts cases where the runner is actually moved.

If you want to build a model for the game, I think your "run expectancy" is a very good approach. But, if you simply want to estimate the value of events in terms of runs, I think regression is a better approach since it's more direct.

A model that assumes independence is necessarily linear (at the level of independence), but a linear model is not necessarily implicitly assuming independence of events. Why do you think that linear regression assumes some sort of independence?

Posted by: Jason at April 25, 2005 05:03 PM

Jason, at the end of article #3, I said:

"The holy grail would be Win Expectancy that includes the hitting team, the fielders, the pitchers, the park, the inning, the score differential, the base-out states, [the count], and the batting order."

So, BaseRuns assumes that independence, but *I* am fully aware that a dependence exists.

***

You are wrong that regression is a better approach, since you've got a serious sample size issue to consider. If you post the error estimate in your coefficients, you will see this to be the case. Just looking at your year-to-year coefficients shows it.

Run expectancy (RE) is the best way to handle this situation. In addition, in my RE model, I include WP, PB, reached on error, HBP... in fact, every single event in baseball. A regression model would have a serious error estimate for these low frequency events. An RE model has no such limitation. Once you know exactly how runs are created, then you know exactly how much even event is worth, within its context.

***

Your linear regression model says that every HR will add 1.4 or so runs each time it happens. But, that's true *only* for the average run environment of your sample. If this was a run environment where the OBA was .100 or .700, this would not be the case. I'm not trying to say whether it's reasonable that you can have a .100 or .700 environment, but only to say that your 1.4 run value for the HR is either assuming complete independence from the frequency of all other events, or it works specifically only in that particular run environment (which means it assums complete dependence on the frequency of all other events).

Posted by: tangotiger at April 25, 2005 05:37 PM

What do you mean by "error estimate in your coefficients"? The regression simply finds coefficients to minimize the squared difference between predicted runs and true runs. Also, there's no reason why more events (WP, PB, etc.) couldn't be included in the regression. I think there would be much less variability in the coefficients if I were to use individual game data rather than data summed over the season.

Yes, the linear regression model is completely dependent on the data. If .700 OBP were the norm, the coefficients linear regression would find would be very different.

We could compare the methods by trying to predict total runs produced for teams in 2005. Interested?

Posted by: Jason at April 25, 2005 07:30 PM

No, I'm not interested, but I've published my equations that anyone can use them.

Oh, your coefficients were based on team/seasonal data? That explains why they are rather poor coefficients.

Jason, I've gone through this hundreds of times, which is why I have zero interest. I highly recommend you read my site, and feel free to make use of what you can. Also, don't forget the outs, though on a game level, since most have 27, they won't have an impact. However, don't force the intercept to zero.

You can get complete game data at Retrosheet for the last 30 years.

Posted by: tangotiger at April 25, 2005 09:16 PM
Post a comment









Remember personal info?