It's been interesting to read the magazines, watch TV, and chat with basketball fans -- everyone making predictions and telling me why so-and-so will win (or can't win) it all. Is there a reason Roy Williams hadn't won a title yet? Can Illinois defend inside -- either against a good post player or against penetration? Would May step it up on his birthday? The running theme in all of these comments is that this particular "reason" would determine the game. In other words, the outcome of the game was predestined, and the only question is whether Roy Williams' history is worse than Illinois' interior defense.
In stark contrast are my computer rankings. They don't know what Williams' postseason record is. Nor do they know how many points in the paint Illinois gave up or how many transition baskets UNC has gotten -- or for that matter, anything except the scores of the past games. Instead, it's just straightforward math based on the assumption that the outcome of the game is determined by the difference between the team strengths plus or minus a bit of luck. As noted here, this approach seems to work, as details like matchups (the sorts of things the talking heads will spout endlessly about) simply don't have any measurable effect on game outcomes. Rather, a good team is a good team, and a bad team is a bad team.
So what did the numbers say about the title game? From the equations, I gave a 0.604 probability of a North Carolina win, with an expected score of 79-76. Based on the comparison with my ever-growing college basketball score library (3345 games like this one), the odds were 0.606 with a median outcome of 79-76. The total score would be 155 +/- 19 (the +/- indicating the 84% confidence limits), and the difference 3 +/- 11. In other words, pretty much dead on -- the actual score was 75-70.
This marks the end of a fairly remarkable run, in which I picked every game correctly in the final three rounds. I'd love to gloat, but unfortunately my professionalism forbids it. You see, as noted above, my picks are merely probabilities. That UNC had a 60 or 61% chance of winning would have been true regardless of whether or not they actually one. Here's a simple illustration: the probability of rolling a 4 or less on a 6-sided die is 2/3; rolling the die and getting a 5 doesn't mean that you were wrong. So certainly the odds of those seven games coming out as I "predicted" was higher than any other possible combination, but it still required a lot of luck to go 7-0. Indeed, I would have expected 2.3 upsets in those 7 games; in that sense my prediction was wrong.
At any rate, I hope that this look at the mathematical side of March Madness has been interesting and informative. The two things I hope you take away:
It's certainly nice to have a round fall entirely into line with my predictions, though that really doesn't say much. This is the problem with posting probabilities instead of pretending that there is such thing as a 'sure-fire' pick. In other words, the fact that I "picked" Michigan State is offset by the fact that I gave them only a 56% chance of winning. Multiplying the probabilities of my four picks, in fact, you'll see that I gave myself only a 24% chance of getting all four games "right". At any rate, I'll refrain from making too much of this; certainly having a better statistical model makes you more likely to go 4-for-4, but it does take luck.
A quick look at the teams that made the final four. Illinois and North Carolina are the two teams that were supposed to make it. Not only were they given the two top #1 seeds by the committee, but they also were ranked #1 and #2 in my predictive rankings. Recall from my initial statistical post that I gave UNC better than even odds of getting this far, and Illinois was second with 42% odds.
Next up is Louisville, the #4 seed in the region that had Washington, Wake Forest, and Gonzaga as the 1-3 seeds. What jumped out to me upon first seeing the bracket was that this was the one region in which the #1 seed was not a clear favority. While Washington was ranked 6th in my "standard" ranking (based on wins and losses), the predictive system put the Huskies down at #13. Likewise Gonzaga was overrated -- #13 standard, #29 predictive. So in breaking down the odds, I had Wake with a 33% chance of winning the region, Louisville with 22%, and Washington at 17%.
Fourth is Michigan State, the #5 seed in their region. Once again, a comparison of standard and predictive rankings made it clear from the start that Michigan State had better odds than your typical 5-seed; they were #16 in the standard system and #9 in the predictive system. That said, #1 seed Duke was a bona fide 1-seed, in that they stood third in the predictive ranking behind UNC and Illinois, and I gave the Blue Devils a 41% chance of winning the bracket. Behind them were Oklahoma (20%), Michigan State (13%), and Kentucky (12%). On the one hand, Michigan State was by a wide margin the least likely of the four region winners. On the other hand, the expected number of teams with between 10 and 19% probability in the final four was 1.1, so it's hardly a surprise that one of those teams made it.
If anything, the first two weeks went more according to plan than would have been expected. You may recall a 65% (a typo in my earlier post) probability that a team not on my first list would advance to the final four, and given the possibility of more than one team advancing, the expectation value (combined probability) of these teams was 1.0, with a standard deviation of 1.0. So having zero is not all that surprising (nor would two have been a big surprise), as it is well within the random uncertainties.
Finally, updated odds of making the title game and of winning the title game:
North Carolina 70% 44%
Illinois 55% 25%
Louisville 45% 18%
Michigan State 30% 13%
With only good teams remaining, the number of upsets continues to rise. This round there were four (Louisville, West Virginia, Arizona, and Michigan State). The predicted number was 3.2, so again the number of upsets is as expected.
Here are the odds of the eight teams still alive making the final four, the title game, and of winning it all:
North Carolina 80% 58% 38%
Illinois 66% 40% 20%
Louisville 80% 41% 19%
Michigan State 56% 20% 9%
Kentucky 44% 14% 5%
Arizona 34% 15% 5%
Wisconsin 20% 8% 3%
West Virginia 20% 4% 1%
What I find interesting is that two of the four games have a clear favorite (UNC and Louisville, both with 4-to-1 chances), while Illinois has about a 2-to-1 shot at winning. The only game with close to even odds is Michigan State vs. Kentucky.
Looking ahead, UNC is still the favorite to go the distance, but its numbers are largely unchanged from last time (as are Illinois'). The teams that really moved up are Louisville and Michigan State -- Michigan State from beating Duke, and Louisville by beating Washington and facing West Virginia rather than Texas Tech.
OK, one week of play is done, and the Cinderella teams are all the rage. The talking heads are going on and on about how UW-Milwaukee was underrated, played with "heart", has players who can come through in the clutch, played like a team, etc. In other words, all of that wishy-washy nonsense you usually hear made up to explain something unusual.
At the same time, the folks who are always whining about computer rankings are crowing about how computer predictions "failed", in that certain upsets (Bucknell, Vermont, UW-Milwaukee) were not predicted.
Both of these arise from the same mistake -- ignoring that sports are very random. And I truly mean random, not merely unpredictable. Think about it: if sports had only a small degree of randomness, most NBA and NHL playoff series would be sweeps. However, the fact that series are even needed indicates that we all understand that the better team won't always win.
What a lot of us seem to forget is that since the better team doesn't always win, then there isn't always a rational reason for a game's outcome. Sometimes a team wins because it's more talented, better coached, has better chemistry, or whatever. And sometimes a team wins because the ball bounced its way.
So back to the NCAA tournament. Based on my predictions, there should have been 13.5 upsets, with a standard deviation of 2.9, in the first two rounds (48 games). In other words, anything from 11 to 16 would have been totally normal. The actual number of upsets was 14. Here's a second stat of interest. There were 14 teams that were long shots in the first round, each with under a 1-in-5 chance of winning. The expectation was that 0.9 such teams would advance, and indeed one (Bucknell) did advance. Does this mean that Bucknell had heart, had clutch players, matched up well, or had something else that made their upset likely? No, they just rolled the dice and came up lucky.
Put differently, the statistical models indeed work, despite ill-informed people claiming to have proven otherwise. The problem is that sports simply aren't that deterministic -- nobody has a crystal ball, but with a good statistical model you can get the probabilities of various outcomes.
Now on to the remaining analysis. Speaking of predictions, after the first round I said that 2.1 +/- 1.2 of the #9-16 seeds would survive the second round, with the most likely two being NC State and UW-Milwaukee. And before any games were played, I said to epect 2-3 #9-16 seeds to make the sweet 16. Score one for the statistical model.
Since we're down to eight games, here are the favorites' odds:
Finally for the updated odds of making the final four and winning it all:
The big winners in the second round were Louisville, Kentucky, Villanova, and Washington. Louisville and Washington both profit from Georgia Tech's loss (Louisville more, of course), Kentucky from Utah's upset of Oklahoma, and Villanova because they upset Florida.
We had eight first-round upsets, which is about what was expected from my win probabilities. This is fairly significant, as it means that NCAA tournament games are just as predictable as regular games.
The biggest upset of the first round was Bucknell, which had an 11% chance of beating Kansas. Of course, by virtue of knocking off the best team in their part of the bracket, they actually have a higher chance (20%) of winning in the second round, as their opponent (Wisconsin) is good, but not as good as Kansas.
Of the eight teams that notched upsets, most should be knocked out this round. Iowa State has a 12% chance of surviving, Nevada 13%, and Mississippi State 19%. This is the joy of being a #9 seed -- you've got a reasonable shot in the opener, but then face the #1 seed in the next game. But likely we'll have one or two of these teams in the sweet 16, as NC State has a 48% chance of making it, UW-Milwaukee 39%, UAB 30%, and Vermont 24%. The total probability of these eight teams is 2.1, with an uncertainty of 1.2. In other words, expect between 1 and 3 of the #9 and lower seeds to advance.
Now for the updated odds of making the final four and winning it all:
The big "winners" in the first round were Illinois and Michigan State. For Illinos, the improvement in their outlook comes from not having to face Texas in the second round, and to a lesser extent, not having to face Alabama in the third round. Michigan State, of course, profited from the upset of Syracuse. Georgia Tech moved up on the list after beating a first-round opponent (George Washington) that had a 32% chance of beating them. Ditto for Arizona, which moved onto the list after beating Utah State. Duke's outlook actually went down by a noteworthy margin for the other reason -- Mississippi State is the tougher second-round opponent, so the possibility that Stanford could have upset them improved Duke's prospects.
Odds are still fair (56%) that a team not in this group will make the final four, with the odds of an outside team winning it all being down to 10%. UConn, Wisconsin, NC State, and Villanova are the other teams with more than a 1% chance of winning.
The "opening round" has come and gone. I'll spare the usual rant about the screwiness of a 65-team tournament. Instead, it's interesting to look at last night's game, which featured 18-13 Alabama A&M against 12-18 Oakland. According to virtually everyone, Oakland was a fluke team that had stunk it up all season (finishing 9-18 before their conference tournament) but managed to win the three games when it counted. In other words, this game would be a gimme for Alabama A&M, which really was the best team in its conference (the SWAC).
Enter my computer rankings, which listed Oakland at #194 in the nation by wins and losses (and actually ranking better than two MCC teams with better records), and #172 in the predictive system (which makes them the third-best team in their conference), behind Oral Roberts and IUPUI. When a team's predictive ranking is higher than its regular ranking, this is generally because it has lost some close games. While the average fan seems to think a team with a bunch of narrow wins has some special ability to "step it up when it counts", in fact teams with closer-than-expected wins generally perform less well in future games. Thus, perhaps counterintuitively, a team that has lost a disproportionate number of close games is probably better than you would expect from its record.
In comparison, Alabama A&M's rankings were #280 based on wins and losses and #277 in the predictive system. To be fair, it's not just the average scores that were different between the two teams; their schedules were very different as well. To pick two comparison teams that themselves had similar schedules, Alabama A&M effectively played 5-22 Cal Poly 31 times, while Oakland effectively played 20-8 Texas A&M-Corpus Christi 30 times. Thus, Oakland's 12-18 record against that quality of competition was at least as impressive as Alabama A&M's 18-13 record against its weak opponents.
So to summarize, the consensus common wisdom is that Alabama A&M would cruise past Oakland and head to its "real" first-round game against UNC. However, Oakland had two very important things going for it. First, it was a little better than its record indicated, and the way we know this is that it tended to "choke" in close games (7-13 in games decided by 11 or less). Second, it played a much tougher schedule, and the way we know this is from my schedule strength calculation -- something that is a rather complex system that I am often criticized for.
At any rate, plugging the two teams into my predictor (before today's update, of course, as today's update includes last night's game) predicts a 75% chance of an Oakland win. In addition, the predicted score is 77-69 by the equations, and 77-69 from the score database. The actual score, of course, was 79-69 Oakland.
Now obviously it's only one game, and not a particularly important game at that (as neither has a realistic shot of advancing any further). However, given all the hate mail I receive whenever a game result is contrary to my predictions (upsets do happen, after all), it's always nice to crow a little bit when a seemingly screwball prediction by my system actually turns out to be correct.
Since selection Sunday, I've started running through the calculations to figure out the odds of teams winning the various potential matchups. This isn't as simple as you might imagine, as the odds of a team winning a third round matchup isn't just the odds of that team beating the best team it could face. Instead, it equals the sum of the probabability of it facing a particular team times the probability of it winning that particular matchup. Technical details aside, what does this mean?
First off, the possible Cinderella teams. Utah State has a 30% chance of making the sweet 16 and 13% chance of making the elite eight. Not bad for a 14-seed. NC State is the most likely bottom-half seed to make the sweet 16, with an odds of 34%, while Pitt's chance is 26%. Of course, a lot of teams have non-negligible (5-10%) chances of making the sweet 16, and the expected number of bottom-half seeds in the sweet 16 is 2 or 3. There is also a reasonable chance that one team seeded 9th or worse will make the elite eight.
Now for the odds of making the final four and winning it all:
Odds are fair (64%) that a team not in this group will make the final four, but there is only a 12% chance that another team will win the tournament altogether. UConn, Alabama, Georgia Tech, and Villanova are the other teams with more than a 1% chance of winning.
As noted in my RPI analysis, this year's version of the RPI formula exaggerates the home court advantage, and has resulted in some very screwball rankings. Thankfully the selection committee realized this, and broke from its traditions. Going by precedent, Miami of Ohio and Buffalo would have been slam-dunk selections, with their RPI rankings of #28 and #31, respectively. The fact that their "improved RPI" rankings were 46 and 36 indicates that the errors in the NCAA's RPI formula were the reason for these spurious selections, but the fact that these teams don't belong is most clearly seen in my computer rankings of them (#59 and #53 based on wins and losses; #69 and #72 based on scores). The selection committee clearly saw that the RPI was not an accurate indicator of these teams' qualities, and wisely chose to leave them out.
On the flip side, I had left Iowa State and NC State off my predicted bracket because of their poor RPIs (#79 and #69, respectively). As with Miami and Buffalo, switching to my improved RPI (#56 and #52) shows that their poor RPIs are in error. And as with those schools, the issue is made even more clear when looking at my computer rankings: Iowa State ranks #43 based on win-loss and #40 based on scores; NC State ranks #36 and #18, respectively.
So kudos to the selection committee for applying common sense to the situation, and removing the two "precedent locks" that least deserved to play while including the two "precedent snubs" that most deserved to play.
This isn't to say that the selection committee went totally by objective ranking systems. Had they downloaded my rankings and picked the top-ranked at-large teams available, teams ranked #48 or better would have been selected, and those ranked #49 or worse would have been left out. There were three exceptions in both cases.
On one hand, the committee selected UCLA (ranked #51 win-loss/#67 scores), Northern Iowa (#52/58), and UAB (#57/46). My theory is that UCLA's selection was largely based on a fairly good RPI (#43), but more importantly that they tied with Stanford for third in the Pac-10. (Indeed, UCLA was on my predicted bracket.) Northern Iowa was a bit of a surprise, as they finished a game behind Wichita State in the Missouri Valley. The errant RPI had Northern Iowa ranked higher (which probably explains the selection), but my corrected RPI and my win-loss ranking system gave the edge to Wichita State. (Frankly, I wouldn't have taken either.) The third surprise selection was UAB, which like UCLA appears to have gotten in with the notion that a major conference deserves four bids. At #57 in my win-loss ranking and #60 in the real RPI, there really isn't a compelling reason to include them.
The three teams ranked #48 or better who were snubbed were Maryland (#40/35), Texas A&M (#41/32), and Ohio State (#44/31). What is noteworthy is that all three teams ranked very well in the margin-of-victory ranking system, which is often a signal that their selection is more likely. Not so for these three teams. Most likely, the committee merely felt that they'd picked enough teams from these three conferences. Maryland and OSU would have been the 6th from their conferences; A&M would have been the 7th. The poor RPI rankings supplied the justification for leaving them out (#65, #86, #61). (To be fair, this isn't a result of the way the NCAA calculates the RPI -- my improved RPI had all three teams ranked poorly. This is just the problem with the RPI in general.)
Now looking at the seeings, I see three bottom-half seeds (#9 or lower) that have fairly decent shots at turning some heads. The biggest surprise could be Utah State, which was given a #14 seed and a first-round matchup with Arizona. Utah State can probably thank the errant RPI calculation for this, as there is little else to explain why Penn got a #13 seed. Regardless, Utah State is #21 in my predictive ranking, while Arizona is #17. Doing the math, I get a 47% chance of the first-round upset. Utah State would be favored against either second-round opponent (LSU or UAB). All in all, I'm giving Utah State a 29% chance of making the sweet 16.
The other two schools would hardly by considered "Cinderellas", as both come from major conferences, but Pitt (#9 seed, #23 predictive) and Iowa (#10 seed, #25 predictive) are both good enough to have gotten top-half seeds if the process were done on predictive team qualities. Pitt managed to draw a first-round game against Pacific (#49 predictive), and has a 65% chance of advancing. Should they advance, they most likely play #1 seed Washington (which is by far the worst of the 1-seeds) and have about a 38% chance of advancing. So that's a 1-in-4 shot of Pitt being in the sweet 16.
Finally is Iowa, which plays #7 seed Cincinnnati (#22 predictive) in the opening round, with a 48% chance to win. They would also be underdogs in the second round, as they face Kentucky (#14 predictive), for a total of a 1-in-6 chance of advancing.
Cinderella stories are always compelling, but let's finish this off with what we all want to know: who's going to win. Division I has three teams that stand well above everyone else right now: North Carolina, Illinois, and Duke. The fourth-best team, Florida, is basically indisinguishable from a slew of other teams: Wake, Louisville, Oklahoma State, etc. So in the final analysis, the question is how many quality opponents our top three are going to have to face.
Starting with Illinois, the best team it will have to face before the sweet 16 is Texas, which it has a 78% chance of beating. In the next round, their toughest opponent would be Alabama (68%). Most likely, Illinois' first tough test will be against Oklahoma State (which shouldn't have too much trouble with any of its bracket opponents), and I'll set the odds of an Illinois win at 60%. Altogether, this probably means about a 35% chance that Illinois makes the final four.
The second seed in the tournament is UNC, which is the best team in the country right now. The Tarheels have no significant opponents until the third round, when they will get either Florida or Villanova. Unfortunately, both of those teams are way better than their seeds indicate, making for about a 72% chance that UNC advances. The next game should be equally tough, with Kansas and UConn both being very good opponents. Wisconsin could also advance out of that half of the region with some luck, and would make things easier for UNC. Taken together, I'm giving UNC a 50% chance of making the final four.
Finally is Duke, which has had some high-profile meltdowns but is still among the nation's elite. Duke should breeze to the sweet 16 (about a 15% chance of losing to Mississippi State) but will be challenged there as both likely opponents (Michigan State and Syracuse) are good. I'll give them a 65% chance of winning in the third round, where they most likely will play Oklahoma, for a total of about a 35% chance of advancing to the final four.
As for the fourth region, there are plenty of good teams, none of which is a serious contender. Washington, Pitt, Georgia Tech, Louisville, and Wake are all quality teams, so I'll have to go with Wake as the selection committee managed to put the other four all in the same half of the bracket.
Update: March 18, 2005
The RPI numbers listed above turn out to be off. The reason for this is that the home/road corrections apply only to a team's rankings, not to its opponents strengths. I confess that I don't quite understand this -- if we're finally acknowledging that a team that played a lot of road games is better than its record indicates, isn't this true of opponentsn as well? At any rate, my error overrated the midmajors (who tend to play more road non-conference games), so some corrections are in order.
First off, the selection committee broke no historical precedents. All 29 teams I would have considered "locks" (RPI in the top 32, or RPI in the top 50 and a major conference team with 20+ wins) were selected. All teams I considered "out" (RPI of 69 or worse) were left out. Thus Miami-Ohio and Buffalo were on the bubble instead of locks, while Iowa State and NC State were on the bubble instead of being left out. And instead of being a questionable selection, UAB was moved to "lock" status by virtue of their being in the top 50 of the RPI.
Applying my usual criteria (picking the top teams off the bubble from my standard ranking) worked decently, grabbing #49 Stanford, #43 Iowa State, and #36 NC State. For the final two teams, though, the committee picked UCLA and Northern Iowa, while I picked Maryland and Ohio State. Granted, I didn't expect Maryland to be picked (16 wins isn't enough), while I did expect UCLA to be picked (for the reasons I gave in my original writeup). The other flip, Northern Iowa for Ohio State, which is probably a combination of Norhtern Iowa's RPI (#37, would have been the highest-ranked team left out) and Ohio State's conference rank (6th).
For those who haven't been paying attention, the RPI formula was changed for this season. There were two changes of significance. The one that people have focused the most on is that game locations are finally accounted for. A game now effectively counts as 0.6 games if the home team wins, or 1.4 games if the road team wins. I like the concept of applying the multipliers, but let's take a closer look to see if this was done correctly.
From both of the home court advantage calculations on my site, I estimate that if two identical teams play, the home team will win about 64% of the time. Supposing that these two teams played each other 50 times, therefore, the "home team" would have an expected record of 32-18. Applying the NCAA's correction, this becomes 19.2 wins and 25.2 losses, or a revised winning percentage of 43%. In other words, the NCAA is badly overcorrecting for game locations, to the extent that it is now recommended that a team play away from home as much as possible in order to "game" the RPI system.
The problem seems to be that the folks who handle the RPI don't realize that the average home team is better than the average home team. That is, non-conference games usually feature a worse team visiting a better team. The result is that the fact that home teams win 69% of games cannot be taken at face value as the degree of the home field advantage. (Indeed, were 69% the magic number, the 1.4 and 0.6 weights would be close.)
In short, because the NCAA refuses to hire competent statisticians, they have created a problem that is nearly as bad as the one that they were trying to fix. The net result is that teams with an excess of road games (mostly midmajor teams) have been systematically overrated.
The second change of note to the RPI is that quality wins and bad losses are no longer factored into the RPI formula. This is a flat-out bad change. While I am happy that there is no longer a "secret" component to the RPI that nobody knows, the fact remains that any statistically sound win-loss ranking system counts the big wins and bad losses the most. So even though this was merely an ad hoc correction to a bad ranking system, it was certainly something that made it resemble a better ranking system.
Note: if you use any of the facts, equations, or mathematical principles introduced here, you must give me credit.