Pretentious Plastic Ops

Space Marines

2024 was a year of gratuitous Astartes violence.

Right after the release of the critically acclaimed Space Marine II, Kill Team developers kept space marine supremacy alive with a new edition of rules, missions, and buffed stat lines that had the top tables of competitive play packed with the Emperor’s (or rather Horus’) finest.

But not all Space Marines were equal; Despite the hype, Angels of Death were delivering a mediocre win rate: 44%.

Yet, to admit defeat is to blaspheme. If you looked at the number of events Angels of Death won, you might be tempted to keep the faith. They had 14 event wins among all events, and at least 2 event wins among the larger, 4-round events. Not bad!

It should be apparent why winning events matters when assessing faction strength and balance. After all, players don’t just want to win games, they aim to take down the whole event! Event performance, in addition to game performance, seems like a valid avenue for judging faction strength, especially for faction performance at the top levels of play.

Still, how do we make sense of Angels of Death’s low win rate? Maybe some noobs just need to get gud? As they say, there are lies, damned lies, and statistics.

Unfortunately, the hype had it wrong. Angels of Death were a weak faction at release; even reddit came to accept this. It wasn’t the win rate that lied, it was the event wins metric.

There were three causes for Angels of Death misleading number of event wins:

Popularity – Angels of Death were the most popular faction in the game by far. They had more opportunities to win events than any other faction.
Noise – Event wins are rare observations; there can only be one per event! It'll take a huge dataset of events to try and overcome the dice and matchup gods.
Event Size – Angels of Death were winning small, 3- and 4-round events, but getting nowhere in the bigger, 6-round events. They had no event wins with more than 21 players! A simple event wins metric completely ignores the difference in size between events.

Clearly, a faction that can win small events due to sheer volume has no business being labeled as overpowered, or even good. This bias is particularly concerning when you realize small events are far more common than large events in our dataset.

Slapping a filter on the data doesn’t fix the issue. Whatever minimum event size we set for our data, such events will inevitably outnumber the larger ones. Besides, good statisticians don’t throw away good data.

Making a Metric

Let’s try and keep that intuition; factions that prove capable of winning events are meaningful and should be measured. We can build upon the event wins metric by trying to solve those three issues.

Accounting for Popularity is easy, we simply divide each faction's number of event wins by the number of players who picked that faction. This converts it into a rate much like win rate, a binomial process:

Event\;Win\;Rate = \frac{Event\;Wins}{Picks}

🧐 Aperçu

Every time a player attends an event with a faction, we call it a "pick". We distinguish picks from players because the same player can attend multiple events in our dataset.

To help turn down the Noise, let’s dig a bit deeper than 1^st place and include other top placings as successes as well. Plenty of players produce quality event outcomes, we don’t need to narrowly focus on golds. Our formula could look something like this:

Placing\;Rate = \frac{Top\;Placings}{Picks}

Great, but what qualifies as a meaningful top placing at an event? To solve our third and final issue, let’s make top placings a Top Percentile. Instead of merely counting 1^st-3^rd place for each event, each event will have a dynamic number of top placings proportional to that event's size. This will enable the metric to be sensitive to event size and flex accordingly; percentiles are a reliable solution for normalizing rankings across different sized groups (which is what events are, a set of different sized groups)

🧐 Aperçu

You can imagine placing rate as a snapshot of the "top-tables" for each event. As the size of the event increases, so does the number of players who made it to a "top-table".

Placing Rate

We have our metric outlined conceptionally, but we still need to determine a specific value for the percentile of top placings. Do we care about the top 25% of placings? Top 10%? Is there an option out there that isn't arbitrary?

I landed on the top 12.5% of placings; as in, 1 top placing for every 8 players at an event.

Swiss events have an interesting pattern where any event with 3-rounds is expected to produce a number of Undefeated players equal to 12.5% of that tournament’s size (so long as you meet the minimum requirement of 8 players). We can use this effect as an informed benchmark for what qualifies as a meaningful top percentile.

🧐 Aperçu

The probability of going undefeated at a Swiss event is 0.5 raised to the power of the number of rounds (0.5^{number of rounds}). Which is 12.5% for 3-rounds, 6.25% for 4-rounds, 3.125% for 5-rounds, etc… You still need to meet the minimum size, which will double for every round you add (8, 16, 32, etc…)

Although events with more than 3-rounds won’t produce the same percentage of undefeated players, they will produce the same percentage of comparable performances; surely a player who finishes in the top 8 of a 64-player event is not at all inferior to a player who takes home the gold in an 8-player event. Both would be inside the top 12.5th percentile respectively.

🧐 Aperçu

For what it’s worth, I also tried other generic percentiles, such as top 10% and 15%. They produced broadly similar results, which is a good thing. It shows our chosen percentile isn't brittle.

This metric is basically saying:

For every faction, I want to count up how many times a player pulls off an event performance comparable to a 3-round undefeated run (or better) as a success. All other performances will count as failures

Since 8-person 3-round events are the minimum size in our dataset, this guarantees every event will produce at least 1 available top placing, with larger events producing more:

# Players	# Top Placings	Successful Placings
`8`	`1`	1st
`16`	`2`	1st, 2nd
`23`	`2.875`	1st, 2nd, 3rd (partial)
`64`	`8`	1st, 2nd, 3rd, 4th, 5th, 6th, 7th, 8th

If you notice, we’re allowing partial successes, equivalent to the remainder (that 3^rd place player in the 23-person event is awarded a 0.875, instead of a 1). This eliminates the need to round and allows us to maintain an expected value of 12.5% regardless of the many odd-sized events that show up in our dataset.

🧐 Aperçu

Because event placings are naturally ranked, finding the "top percentile" is just simple multiplication (# of Top Placings = Event Size * 0.125). Later, we'll handle sampling error in placing rate itself by modeling it with a beta distribution.

X-1’s

A more common approach to measuring event performances, at least in the broader wargaming community, is to count all players who only lost 1 or 0 games at an event. This metric is called X-1; it is often valued as a secondary metric to event wins.

It’s a very approachable metric, understandably so.

Unfortunately, it shares similar flaws as event wins. Its most pressing issue can be seen in the name itself; X-1 is defined by a fixed property.

It works great when every event has the same size and structure. However, a fixed definition of success introduces bias as soon as you include events with different sizes and structures. Successes in a 3-round event are categorically different outcomes than successes in an 8-round event.

X-1 isn't “useless”; however, it's a classic case of measuring the wrong thing. An event performance is considered good relative to all other performances at that event; good performances are not reducible to an arbitrary number of losses robbed of their context. Event size is crucial to the context of an event performance.

The percentile strategy works because it defines success relationally. It generalizes across our different sized events; the difficulty of reaching the top 12.5% scales naturally with each Swiss event regardless of its structure.

Other Percentiles

Finally, we could always use a different top placing percentile, such as 6.25%. This percentage would be informed by 4-round undefeated runs and their comparable performances. But I don’t think Kill Team is popular enough for such a low value. We’ll easily fall back into having too much Noise (it’s a huge jump in Relative Variance). Since we still rely a lot on 3-round events, it’s best to keep 12.5%.

Next up, we'll be jumping into the most exciting piece I’ve ever written: the joys of collecting and cleaning Kill Team event data!

Space Marines

Making a Metric

Placing Rate

X-1’s

Other Percentiles

On This Page