Roller Derby International Rankings for 2014: An Analysis

On the back of our attempts at a more neutral ranking scheme for the World Cup national teams, and some subsequent conversation on the Rollerderby subreddit and Twitter, we’ve made a few tweaks to our ranking algorithm, and fed it the games from European Roller Derby Tournament and Road To Dallas as well as the World Cup games it had before.
[Technical notes: we’ve been persuaded that the topological sorting constraint is overly aggressive in asserting groupings from the initial ground truth, which adds some inflexible structure to the sort, so instead we iterate the inference rankings to self-consistency and sort on the relative strengths in the self-consistent inference matrix, using the final inferences only. Code for this has been added to the repository here. ]

We hear that there was some controversy on our previous rankings being published, although none of the teams have actually contacted us to make any comment. (The only actual direct comments we’ve had have been from people expressing interest in the statistical methods used, which were chosen to be neutral. In fact, this article was written partly on the request of a number of people to see what the inclusion of additional datasets did to our ranking calculations.) We, of course, welcome any criticism of our methods and discussion of the results of our analysis.

In order to allay any possible sense of bias being present in what was designed as a neutral computational statistical ranking, we have also calculated the least squares rankings for the National Teams from the same data set (plus the Italy v Switzerland result from the same timeframe), with both score difference and log(score ratio)1 rankings. In this, we follow Massey, considering that least squares rankings are generally held to be amongst the most accurate ranking methods for predictive sports ranking.2

[Technical notes: The approach used for least squares ranking was to form the usual matrix of games from Massey, but rescaling scores on the 40 minute bouts by 60/40 to estimate the score difference in a full game. For score ratios, this was not necessary, as ratios are scale-invariant measures (which is the reason we decided to use them for our metric), but we do provide the same capping of blowouts as if the zero-scoring team really scored 1/2 a point as we do in our own measure. In the National Teams dataset, this only affects the Sweden-Japan game: altering the game to give a single point to Japan instead does not affect the final placement of Sweden by the ranking, although it does adjust their predicted power. We prepared data using python scripts for processing in GNU Octave for the least squares fit; the scripts used are also available in the same repository as our ranking code.]

Our modified iterative ranking predicts the following ranking of teams, with the old algorithm by the side of it for comparison and the Least Squares ranks in addition4. We’ve colour coded the teams by their divergence from the official B&T World Cup rankings for interest (Black is the same as B&T, Green is higher ranked than B&T, Red is lower ranked than B&T).

Topological Sort (old model) Iterative consistency (new model) Least Squares Rank with respect to Score Difference
(Power is expected score diff)
Least Squares Rank with respect to Score Ratio
(Power is expected score ratio)
1. USA
2. England
3. Australia
4. Canada
5. Finland
6. Sweden
7. Scotland
8. Argentina
9. Ireland
10. France
11. Germany
12. NewZealand
13. Belgium
14. Norway
15. Netherlands
16. WestIndies
17. Wales
18. Denmark
19. Colombia
20. Spain
21. Greece
22. Brazil
23. Chile
24. Italy
25. Mexico
26. Portugal
27. SouthAfrica
28. Japan
29. PuertoRico
30. Switzerland
1. USA
2. England
3. Australia
4. Sweden
5. Canada
6. NewZealand
7. Finland
8. France
9. Ireland
10. Germany
11. Scotland
12. Argentina
13. Netherlands
14. Belgium
15. Spain
16. Norway
17. Wales
18. Chile
19. Denmark
20. WestIndies
21. Colombia
22. Greece
23. Brazil
24. SouthAfrica
25. Portugal
26. PuertoRico
27. Mexico
28. Switzerland
29. Italy
30. Japan
1. USA 0.0
2. England -191.00
3. Australia -213.273
4. Canada -294.666
5. Sweden -380.901
6. Finland -460.143
7. NewZealand -465.461
8. France -539.895
9. Germany -554.373
10. Scotland -567.636
11. Argentina -570.828
12. Ireland -572.967
13. Belgium -721.009
14. Norway -729.422
15. Spain -743.184
16. Chile -744.361
17. Wales -748.199
18. Denmark -752.251
19. Netherlands -759.382
20. Colombia -768.326
21. WestIndies -788.968
22. Mexico -870.077
23. Greece -886.9503
24. Portugal -887.2898
25. SouthAfrica -893.530
26. Brazil -895.490
27. PuertoRico -944.083
28. Switzerland -955.1116
29. Italy -975.691
30. Japan -1062.202
1. USA 1.0
2. England 0.3009
3. Australia 0.2700
4. Sweden 0.1516
5. Canada 0.1317
6. NewZealand 0.06539
7. Finland 0.05634
8. France 0.04562
9. Ireland 0.04243
10. Germany 0.04091
11. Scotland 0.03381
12. Argentina 0.03077
13. Belgium 0.01542
14. Netherlands 0.01384
15. Norway 0.01355
16. Chile 0.01341
17. Wales 0.01295
18. Denmark 0.01169
19. Spain 0.01157
20. WestIndies 0.01128
21. Colombia 0.009998
22. Greece 0.006950
23. Brazil 0.005783
24. Portugal 0.005648
25. SouthAfrica 0.005354
26. Mexico 0.005206
27. PuertoRico 0.005150
28. Switzerland 0.004456
29. Italy 0.003946
30. Japan 0.001298

All rankings show that there are groupings of teams with extremely close strength (for example (Canada,Sweden), (Scotland, Ireland, Argentina) and (England, Australia) ), but there are elements of disagreement in precise ranking.

We also see that it is still unambiguous that Germany was unfairly relegated from the Top 16 due to poor group selection in the World Cup (another three big upward movers that we missed before are Spain,  Greece and Chile, with Spain looking like another possible-Top-16). We’re particularly impressed by the performance of Greece, as they had very little practice time before the World Cup itself.

The “amount of divergence” from the B&T tournament ranking increases as we move away from the top ranks, which is precisely as expected for a single-elimination tournament ranking. (The case of Wales, which is incongruous in being ranked precisely as B&T does for all of the rankings, is probably due to it being a pivot point, on the edge of the Top16 ranking. As our rankings are all tournament-neutral, there’s a tendency for teams to shuffle ranking relative to the tournament, pivoting around the tournament boundaries. We also see this around the Top 8 boundary, with more fuzz due to the higher concordance with the B&T rankings in general. That is: this apparent structure is a reflection of the tournament structure itself, rather than the ranking methods here, which are structureless except for the Topological Sort.)

In general, the pure ratio based models (our new model and log(ratio) least squares) agree with high correlation for the majority of the table, with the ranks around the 15th to 19th positions showing the worst concordance. As both methods are global optimisation schemes, we’d expect them to agree substantially on the rankings, given the same metric; the least squares method has the advantage of executing substantially more quickly! The topological sort has the highest rank disagreement with the other models, although it agrees with some general properties of the ordering (it is the only sort to agree generally with some of the B&T Cup ranking properties, as it tends to lower the rank of teams who played less games, a property enforced on some teams by the tournament structure itself). The linear score difference least squares is also surprisingly congruent with the ratio-based metrics, outside of a few anomalies like the lower placement of the Netherlands, and it does tend to uprank and downrank the same teams as the ratio methods, relative to the B&T tournament ranks.

The pattern of up and down ranked teams in the 8-16 rank positions, with substantial agreement across all of the three latter rankings, is largely consequence of the “score difference from last bout” ranking chosen by B&T. As we mentioned in other comments, there are issues with such a ranking mechanism, as score-difference is only a measure of the relative skill difference between two teams, not an absolute measure. As the difference in skill in the top 8 is unambiguously large, the score difference for 8-16 rank teams can be dominated by which of the Top 8 teams they played, rather than the actual difference in skill within the 8-16 rank.

On the basis of this comparison, and for additional interest, we also calculated Least Squares rankings for the Men’s National Teams who attended the Men’s Roller Derby World Cup 2014. Again, we’ve colour coded for alterations relative to the official tournament rankings; as the MRDWC2014 allowed draws for 7th and 11th places, we’ve half-coloured teams which are ranked in the “7,8”th places or “11,12”th places when MRDWC2014 assigned them to the drawn 7th and 11th positions.

Rank with respect to Score Difference(Power is expected score difference) Rank with respect to Score Ratio(Power is expected score ratio)
1. USA 0.0
2. England -160.067
3. Canada -196.457
4. France -288.076
5. Australia -414.936
6. Wales -425.478
7. Argentina -490.412
8. Finland -566.641
9. Scotland -573.354
10. Ireland -576.750
11. Belgium -676.660
12. Germany -723.834
13. Netherlands -738.020
14. Sweden -788.960
15. Japan -905.176
1. USA 1.0
2. England 0.273
3. Canada 0.267
4. France 0.132
5. Australia 0.0664
6. Wales 0.0531
7. Argentina 0.0394
8. Scotland 0.0271
9. Ireland 0.0236
10. Finland 0.0147
11. Belgium 0.0136
12. Netherlands 0.00898
13. Germany 0.00791
14. Japan 0.00630
15. Sweden 0.00491

As can be seen, despite being a single-elimination tournament, with relatively unknown team rankings, MRDWC2014 did remarkably well at ranking its teams compared with post-hoc statistical methods – even the differences in ranking are almost all 1 position shifts! This is partly because the tournament was half as big, of course, which makes ranking geometrically easier. It was also, however, because the tournament design was explicitly constructed with proper ranking in mind (this can be seen in the refusal to separate the drawn 7th, 11th positions on the principle that the paired teams never played each other, and so there is no ground truth to separate them), which helped to ameliorate the deficiencies of the single-elimination format. (Splitting the difference between the two least squares methods would seem to suggest that Ireland/Scotland/Finland deserve a three way tie for 8th place, and Germany/Netherlands should be tied for 12th rather than 11th, but these are small divergences from the tournament ranking.) This ranking, of course, does not include any other games outside the MRDWC2014, and so should only be considered representative of the state of the teams at that time.

Returning to the Women’s National Teams, our main conclusion is that we would really like to see Canada play Sweden at some point in the near future (and Finland take on Sweden in a rematch). In general, we’d like to promote the use of fairer ranking schemes, and more thought in planning large tournaments in order to encourage the fairer ranking of those competing. The example of MRDWC2014 shows that it is quite possible to manage a tournament, with care, to maximise the neutrality of the contest, whilst still admitting other constraints (such as getting teams from different geographical locations to play each other).


1We have to use log(score ratio) for a least squares regression to make the measure linear. This issue with the linearity requirement was one reason we didn’t adopt a least squares method for our initial inference model.

2For example, Sports Rankings REU Final Report 2012 notes that least squares minimisation provides the most accurate predictive rankings for Basketball and Football out of all of the (non-simulational) methods they compare, and the Bracketology review of College Basketball rankings and this comprehensive analysis of ranking predictive systems across many sports also show that “Massey”/least squares methods have good predictive power. Even in a comparison of football team prediction, where home-team advantage is not modelled by simple Massey predictions, it is still one of the best “simple” models tested. This is unsurprising3, as least squares regression is one of the most tested means of statistical modelling of (linear) functions in modern science.

3Of course, least squares methods, like our inference scheme, assume that “superiority at a game” is a transitive condition, which is not necessarily true in sports (you can imagine a team whose tactics are simply ill-suited to an opponent of similar ability). However, the real world performance tests of the method suggest that transitivity does hold strongly enough in many sports for least squares methods to provide good metrics.

4 FlatTrackStats uses Elo ranking methods instead, which do not assume transitivity, and have similar performance properties to Massey least squares rankings (the FTS algorithm uses a normalised score difference method, slightly different to our pure ratio, to determine team strength, for the same “scale-invariance” property that we value5, and also apply a non-Gaussian error estimator). Elo rankings tend to perform better with lots of contests between players, as the estimator works by “transferring” points from a losing team to a winning one. This also means that it scales better with huge numbers of contests – it’s a good choice for FTS to use, given the size of their bout database. Global estimator methods, like linear least squares, are better suited to tournament style prediction, however, where the number of contest pairs is small compared to the total space, and the games are all played in a relatively short timeframe (and there’s no home-field advantage). The supplied python scripts also generate a ranking based on the FTS normalised score difference using least squares optimisation, so the interested reader can generate the pseudo FTS ranking themselves. We don’t publish it here to avoid filling the table with too many very similar results (the rankings produced are generally half-way between the score difference and pure ratio least squares rankings, with the only significant deviation being a particularly low estimated ranking for Portugal, which we don’t really understand).

5Direct evidence in favour of scale-invariant ratio measures like log(ratio) and FTS style “normalised score difference” comes from the Men’s Roller Derby World Cup. Belgium and Japan faced each other twice during the tournament, once in the group stage and once in a full length bout, as did Germany and Ireland. Computing the ratio of scores and the normalised difference of scores for both bouts produces estimated relative strengths for the two teams which match very well (almost perfectly for Germany/Ireland, and within 20% for Belgium/Japan, where we would expect a higher disparity due to Japan’s own rapid skill development). Computing the score difference does much more poorly (off by 100%+ in both cases)!

4 thoughts on “Roller Derby International Rankings for 2014: An Analysis

  1. Pingback: Roller Derby Analytics: Comparing performance with FTS on the WFTDA D1 Playoffs | scottish roller derby

  2. Pingback: MRDWC2016 – Rankings and Statistics | scottish roller derby

  3. Pingback: The Road to TBC: Lessons from ERDT and the last World Cup! | scottish roller derby

  4. Pingback: MRDWC 2018: Post-Mortem – scottish roller derby

Leave a Reply

Fill in your details below or click an icon to log in: Logo

You are commenting using your account. Log Out /  Change )

Google photo

You are commenting using your Google account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s

This site uses Akismet to reduce spam. Learn how your comment data is processed.