Rankings

Ranking performance estimates while valuing popular representation.


Shrinkage

The previous section was a crash course on how to produce a Bayesian Average through an Empirical Bayes process with wargaming statistics.

The impact of this process is commonly called Shrinkage. For us, shrinkage causes underrepresented factions to get dragged towards the expected value (50% for win rate). Take a look at difference between win rates and Bayesian win rates in this table:

FactionGamesWin RateBayes Win Rate
A3569.7%58.9% ↓↓↓
B6858.1%55.0% ↓↓
C40357.4%56.7%
D20741.5%42.9%
E2032.3%45.0% ↑↑↑

Factions A, B and E are adjusted much more than factions C and D. We’re effectively saying:

With little sample data, we need a lot of evidence to conclude whether this faction is good or bad.

Meanwhile, highly representative factions overpower their prior and have Bayesian win rates close to their raw win rate. If their raw win rate is good or bad, their Bayesian win rate will be the same. The assumption behind all of this is:

Extraordinary outliers require extraordinary evidence.

If we were only concerned with finding outliers in our data, this effect would be a perfect fit. But what if we wanted to rank our factions from best to worst?

Judgment

Notice how faction E jumped quite a bit with its Bayesian average in the above table. It went from the terrible 32.3% to the mediocre 45.0%. Weak, but still within James Workshop's sacred Goldilock's zone of balance (45%-55% win rate).

Empirical Bayes is telling us faction E might not actually be extraordinarily bad. A sample size of 20 is much too small to jump to conclusions; Our "best guess" is that it's probably a weak faction.

Although this is technically correct, when we go to rank our factions, we might still want to rank faction E below the others.

Why? Because faction sample sizes in wargames are not arbitrary or random. They’re meaningful; they reflect something real about a faction’s usage and place in the meta.

A popular faction may not overperform or underperform, but by virtue of its popularity, it’s important for us to respect it. Certainly a popular faction with a 48% win rate is more important than an unpopular faction with a clean 50% win rate. The former is much more significant and influential on the meta than the latter.

This desire to include popularity in rankings is not about establishing an estimate, it’s more about issuing a Value Judgment. We value popular factions; we want popularity to play a role in the rankings. Fortunately, Bayes can help us with this too.

Pessimistic Rankings

Suppose we began our inference assuming every faction was weak, instead of average. The reasoning would go like this:

I’m going to assume you’re a bad faction; if you think you’re good, how about you prove it?

A certain meme comes to mind…

Like any good inquisitor, such a view expresses skepticism towards our estimates, holding a more cautious, critical posture.

Fortunately, we can pull this value directly from our Bayesian estimate.

How? Well, remember when I said Bayesian estimates are also probability distributions called Posteriors? Instead of taking the average of every faction’s Bayesian estimate, we can take a lower percentile instead. Then, rank each faction with that lower percentile.

Factions with lots of data produce narrow, confident distributions; factions with little data produce wide, uncertain distributions. When we take a pessimistic, lower percentile, that value will drop much further for the uncertain distributions than it will for the confident distributions. This naturally rewards factions with more games, but in a way that is driven by the Bayesian estimates themselves.

Beta Prior

Notice how narrow the estimate for Wolf Scouts is compared to Gellerpox. It's lower 10th percentile is much higher due to the high confidence we have in the estimate. But an uncertain Gellerpox still does better than Yaegirs!

🧐 Aperçu

The Bayesian shrinkage we performed earlier doesn’t just apply to the average; it also applies to the percentiles (the entire distribution)! This is why these two steps work well together.

Okay, but what percentile should we pick? We could realistically pick a bottom percentile anywhere between 25% to 2.5%.

Here’s where it becomes a bit of a judgment call. If you couldn't already tell from the graph, I went with the 10th percentile. This value has two qualities:

  1. The bottom 10th percentile is a common, conventional choice for a pessimistic ranking (right after 5%).
  2. The bottom 10th percentile of our prior is about a 40% win rate, a clean benchmark for a bad faction.

Seeing the relationship between the 10th percentile and a 40% win rate should give you an idea of just how pessimistic we’re being. It's like we're assuming every faction is that bad; we expect them to prove otherwise.

I also like that it’s below Games Workshop’s own standard for balance (45% at the low end), but not as pessimistic as the very worst factions we’d expect to see.

Adding It Up

Here are the steps we take for building our estimates within our ranking algorithm, as applied to each faction:

  1. We start with our empirical prior; what we believe an unknown faction looks like.
  2. We update it with real faction data; an estimate of a faction's true win rate.
  3. We grab the 10th percentile of that estimate (posterior); what we believe a bad version of this faction looks like.

This process is at the heart of what powers this site’s Rankings page; that's how it values factions by their performance and popularity.

Placing Rate

The ranking algorithm does consider both the win and placing rate pessimistic point estimates. However, once it calculates those, there are a couple of hoops it needs to jump through before it can join them into a single value for sorting.

First, since win rate and placing have different means and variances, we need to Standardize before combining them. Passing each pessimistic value through its respective prior's CDF converts both into a common scale: how good is this result compared to our original expectations (i.e. the priors)?

Next, we weight win rate twice as much as placing rate. Like the selection of the pessimistic percentile of 10%, this choice is based on a value judgment, not some objective principle. We value win rate twice as much due to its status as our primary metric.

Additionally, win rate has lower variance than placing rate; it’s more reliable early in the quarter when sample size are small.

Next: Confounders

With this section wrapped up, I wanted to touch on what's next in the pipeline.

At the start of this series, we highlighted the difference between Signal and Noise. Signal, it was mentioned, includes both player skill and faction strength; neither is noise.

However, that presents an issue: faction performance becomes a composite of faction strength and the skill of that faction's player base.

If there's a Selection Bias between player skill and faction choice, then player skill acts as a Confounding Variable. Player skill becomes a third variable, influencing both the independent (faction) and dependent (observed performance) variables.

I have some docs in the works that will dive deeper into this, including what we can do about it.

However, just because a confounder exists, doesn’t mean it’s particularly strong. Fortunately, player faction choice is complex; it involves many factors unrelated to player skill. It's not unidirectional, which keeps its influence modest.

Still, it's worth pursuing, I just didn’t want to delay releasing coverage of what I’m currently doing for something that is a bit further down the road.

For now, you made it to the end; congrats!

Hope you enjoyed the read! 🧐🥃