Ranking Tutorial: Win-Loss Rankings

If strength and predictive ratings are the best possible rankings using the available data (game locations, opponents, and scores), why bother with limited rankings that consider only game locations, opponents, and who won and lost? The answer is that it depends on what you want to do with the ranking. If you want to attempt to predict a future game's results, you absolutely want the predictive ranking, since it uses more data. On the other hand, if you want to decide who should play for a national title, there are two problems with using the predictive ranking.

First, you are judging teams based on criteria that are different from what they are trying to accomplish. Some teams may have a style that encourages close wins, but nevertheless win most of their games; it would be unfair to penalize them for not winning by big enough margins. Second, if your ranking is actually being used in a significant way, coaches who understand it will be at an advantage, since they can manipulate game scores to maximize their team's ranking.

In a nutshell, you want the predictive ranking to predict future games, and a good win-loss ranking to evaluate past seasons.

Some Scores?

The overall form of the probability equation is identical to that used to build the strength rating:

   -2 lnP = sum(i=games) P(result|rai,rbi,h) + sum(i=teams) Pr(ri) + Pr(h)

I have defined Pr(x) to represent the prior. The game result probability function is CP(winner-loser), adjusted for game location as always. In the case of a tie, the function is instead NP(winner-loser).

An interesting question is raised, however. Do we ignore scores for all teams, or just for the team(s) being ranked in this particular calculation? Ignoring all scores may seem the easiest option, but throws out a lot of data that need not be thrown out. A common result of doing this is the overranking or underranking of entire conferences, especially in college football where most of a conference's relative ranking is based on a small number of games. To to make this ranking as accurate as possible, therefore, it is imperative to use game scores to rate games among teams being ranked in the current calculation.

The Median Likelihood Rating

A complication is that the win-loss game probability function is not a simple Gaussian, and thus we cannot directly compute the 117-fold (in the case of I-A football) integral without using an inordinate amount of computer time. One way to handle this is to use a process much like what was done for the predictive ratings. Setting all teams other than the one being ranked equal to their strenth ratings, the probability of a team having a rating of "r" equals:

   P(r) = prod(i=wins) CP((r-oi)/(1+doi^2)) * prod(i=losses) CP((oi-r)/(1+doi^2))
     * prod(i=ties) NP((r-oi)/(1+doi^2)) * Pr(r)

all definitions have been retained from previous sections. This function is straightforward to evaluate, since oi and doi were determined in the strength ratings.

The question is what to do with this probability distribution. There are three quick options: mean, median, or mode. The most common choice is the mode, which is the value of "r" where P is maximized. Another option in use is the mean, which is defined by:

   mean = [ integral(r=-inf,inf) r P(r) dr ] / [ integral(r=-inf,inf) P(r) dr ]

Both of these are reasonable options, but share a common failing in that they are scale-dependent. If I choose to rate teams on a zero-to-one scale using CP(r), the mean and median points would fall at different places of the probability distribution. In other words, CP(mean(P)) does not equal mean(CP(P)), nor does CP(mode(P)) equal mode(CP(P)).

This leaves the median, which is the value of r such that 50% of the probability distribution is at higher r, and 50% is at lower r. This definition value turns out to be scale invariant: CP(median(P)) equals median(CP(P)). Since the definition of the scale is arbitrary, it is preferable to rate the teams in such a way that the rating would be the same, regardless of scale. In this case, the median is the optimal choice.

Probable Superiority

Computer rankings are somewhat notorious for spitting out a set of numbers for each team, ordering the teams by their ratings, and calling that a ranking list. For my standard ranking, I prefer an entirely different approach that is conceptually more what a ranking list ought to be. The basic premise is to make a set of team-by-team comparisons, and rank teams so that (if possible) every team is deemed superior to all teams ranked behind it.

The key is therefore to determine the odds that one team is better than another, not using scores of either team's games. Defining P(a,b) as the probability that team A has a rating of a and that team B has a rating of b, the odds that A is the superior team equals:

   P(a>b) = [ integral(a=-inf,inf) integral (b=-inf,a) P(a,b) da db ]
          / [ integral(a=-inf,inf) integral (b=-inf,inf) P(a,b) da db ],

As before when measuring the strength rating, P(a,b) includes the probabilities of all games occuring and all team ratings matching their priors. The difference is that the simple Gaussian game probability function has been replaced by CP(winner-loser) for the games played by teams A and B.

As noted, this cannot be marginalized and integrated directly, so there are three ways of doing the solution.

As with the median likelihood ranking, use the strength ratings of all teams other than A and B to compute P(a,b). The main drawback is that covariances between a and the other ratings, and between b and the other ratings are ignored. A second drawback is that a team's scores indirectly affect its rating, as its opponents strength ratings depend somewhat on the team's scores. (To manipulate this feature, however, a team would have to ease up on its best opponents, which generally isn't a bright idea.)
A second option is to use a Gaussian approximation of CP(x) for each game played by teams A and B. The Gaussian approximation is just a second-order Taylor series expansion of ln CP(x) centered on the median likelihood scores of A and B and the strength ratings of all other teams. Obviously the imprecise evaluation of CP(x) is problematic, and can cause problems in the wings of the distribution.
Finally, one can forget about the N-fold integral and instead run a maximum likelihood solution. This means that, for each two-team comparison, one measures the set of team ratings that produces the highest total probability. The drawback to this method is that the probability distribution for P(a,b) is ignored altogether, thus giving no information about the shape of the wings of the distributions. In addition, one is forced to assume something about the uncertainties of a and b, which tend to be about 1.4 times the uncertainties of the teams' strength ratings.

In short, there are several ways of evaluating the problem. None is perfect, but the flaws in each are different. I have tested all three routines and find that, by and large, the results are comparable. I therefore choose the first option, which runs much faster than the others.

As an example of this at work, consider Michigan's ranking in the 2003 football ranking (pre-bowl). My system lists the Wolverines 10th in the nation, while they are 4th in the consensus computer ranking. Can I be that far wrong? I was worried that the approximations made in the computer rankings may have screwed something up. However, running through options 2 and 3, I found the same result: a ranking using only score data agrees with a ranking using only win-loss data, but my ranking (using a mix) disagrees with both.

This highlighted an interesting case that is worth elaborating on. According to a score-based ranking like my predictive system, Michigan is ranked 5th or 6th. The difference between that ranking and mine is that Michigan tended to win its games by higher margins than expected (or conversely lost a few games they should have won based on their other results). One cannot say which is right and which is wrong, of course, since predictive and win-loss rankings have fundamentally different goals. On the flip side, systems that use only win-loss results placed Michigan around #6 as well, and tend to overrate some of Michigan's top opponents, most notably Ohio State (#11 in my standard ranking, #22 in my predictive ranking), causing Michigan to be overranked. In this case, Michigan's ranking is more accurate in my system because its opponent strengths are more accurate.

Return to ratings main page

Note: if you use any of the facts, equations, or mathematical principles on this page, you must give me credit.