Rankings
Ranking performance estimates while valuing popular representation.
Shrinkage
The previous section was a crash course on how to produce a Bayesian Average through an Empirical Bayes process with wargaming statistics.
The impact of this process is commonly called Shrinkage. For us, shrinkage drags underrepresented factions towards the expected value (50% for win rate). Take a look at difference between win rates and Bayesian win rates in this table:
| Faction | Games | Win Rate | Bayes Win Rate |
|---|---|---|---|
| A | 35 | 69.7% | 58.9% ↓↓↓ |
| B | 68 | 58.1% | 55.0% ↓↓ |
| C | 403 | 57.4% | 56.7% ↓ |
| D | 207 | 41.5% | 42.9% ↑ |
| E | 20 | 32.3% | 45.0% ↑↑↑ |
Notice how factions A, B and E are adjusted much more than factions C and D. We’re effectively saying:
With little sample data, we need a lot of evidence to conclude whether this faction is good or bad.
Meanwhile, highly representative factions barely move; their Bayesian win rates are close to their raw win rate. The amount a faction is adjusted is commensurate to that faction's sample size. The assumption behind all of this is:
Extraordinary outliers require extraordinary evidence.
If we were only concerned with finding outliers in our data, this effect would be a perfect fit. But what if we wanted to rank our factions from best to worst?
Judgment
Notice how faction E jumped quite a bit with its Bayesian average in the above table. It went from a terrible 32.3% to a mediocre 45.0%. Weak, but still within James Workshop's sacred Goldilock's zone of balance (45%-55% for win rates).
Empirical Bayes is telling us faction E might not actually be terrible. A sample size of 20 is much too small to jump to conclusions; Our "best guess" is that it's at least on the weak side of balance.
Although this is technically correct, when we go to rank our factions, we might still want to rank faction E below the others.
Why? Because faction sample sizes in wargames are not arbitrary or random. They’re meaningful; they reflect something real about a faction’s usage and place in the meta.
A popular faction may not overperform or underperform, but by virtue of its popularity, it’s important for us to recognize it. Certainly a popular faction with a 48% win rate is more important than an unpopular faction with a clean 50% win rate. The former is much more significant and influential on the meta than the latter.
This desire to include popularity in rankings is not about establishing an estimate, it’s more about issuing a Value Judgment. Because we value popular factions, we want popularity to play a role in the rankings. Fortunately, Bayes can help us with this too.
Pessimistic Rankings
Suppose we began our inference assuming every faction was weak, instead of average. The reasoning would go like this:
I’m going to assume you’re a bad faction; if you think you’re good, how about you prove it?
A certain meme comes to mind…
Like any good inquisitor, such a view expresses skepticism towards our estimates, maintaining a more cautious, critical eye.
We can adopt such an approach directly with our Bayesian estimates.
How? Well, remember in the previous piece when I said Bayesian estimates are also probability distributions called Posteriors? Instead of taking the average of every faction’s Bayesian estimate, we can take a lower percentile instead; then, rank each faction using that lower percentile.
Factions with lots of data produce narrow, confident distributions; factions with little data produce wide, uncertain distributions. When we take a pessimistic, lower percentile, that value will drop much further for the uncertain distributions than it will for the confident distributions. This naturally rewards factions with more games, but in a way that is driven by the Bayesian estimates themselves.

Notice how narrow the estimate for Wolf Scouts is compared to Gellerpox. It's lower 10th percentile is much higher due to the high confidence we have in the estimate. But an uncertain Gellerpox still does better than Yaegirs!
The Bayesian shrinkage we performed earlier doesn’t just apply to the average; it also applies to the percentiles (the entire distribution)! This is why these two steps work well together.
Okay, but what percentile should we pick? We could realistically pick a bottom percentile anywhere between 25% to 2.5%.
Here’s where it becomes a bit of a judgment call. If you couldn't already tell from the graph, I went with the 10th percentile. This value has two qualities:
- The bottom 10th percentile is a common, conventional choice for a pessimistic ranking (right after 5%).
- The bottom 10th percentile of our prior is about a 40% win rate; a clean benchmark for a bad faction.
Seeing the relationship between the 10th percentile and a 40% win rate should give you an idea of just how pessimistic we’re being. It's like we're assuming every faction is that bad; we demand each of them to try and prove otherwise.
I also like that this value is below Games Workshop’s own standard for balance (45% at the low end), but not as pessimistic as the very worst factions we’d expect to see.
Adding It Up
Here are the steps we take for building our estimates within our ranking algorithm, as applied to each faction:
- We start with our empirical prior: what we believe an unknown faction looks like.
- We update it with real faction data: an estimate of a faction's true win rate.
- We grab the 10th percentile of that estimate (posterior): what we believe a bad version of this faction looks like.
This process is at the heart of what powers this site’s Rankings page; that's how it values factions by both their performance and popularity.
Placing Rate
The ranking algorithm factors in both win and placing rate pessimistic point estimates (say that ten times fast). However, once it calculates those, there are a couple of hoops it needs to jump through before it can join them into a single value for sorting.
First, since win rate and placing rate have different means and variances, we need to Standardize before combining them. Passing each pessimistic value through its respective prior's CDF converts both into a common scale. It's like asking:
How good is this pessimistic percentile compared to our original expectations (i.e. the priors)?
Next, we weight win rate twice as much as placing rate. Similar to our choice of a 10% percentile, this choice is also based on a value judgment, not some universal principle. We value win rate twice as much due to its status as our primary metric.
Win rate has lower variance than placing rate; it’s more reliable early in the quarter when sample size are small.
Next: Confounders
With this section wrapped up, I wanted to touch on what's next in the pipeline.
At the start of this series, we highlighted the difference between Signal and Noise. Signal includes both player skill and faction strength; neither is noise.
However, that presents an issue: faction performance becomes a composite of faction strength and the skill of that faction's player base.
If there's a Selection Bias between player skill and faction choice, then player skill acts as a Confounding Variable. Player skill would be a third variable, influencing both the independent (faction) and dependent (observed performance) variables.
I have some docs in the works that will dive deeper into this, including what we can do about it.
However, just because a confounder exists, doesn’t mean it’s particularly strong. Player faction choice is complex and involves many factors unrelated to player skill; it's not unidirectional, which keeps its influence modest.
Another nice advantage from the shrunken, pessimistic point estimates: they stabilize our estimates, which indirectly helps manage the extra variation (overdispersion) caused by player skill.
Still, it's worth addressing, I just didn’t want to delay releasing coverage of what I’m currently doing for something that is a bit further down the road.
For now, you made it to the end; congrats!
Hope you enjoyed the read! 🧐🥃