1) “Y% of the top 8(s) is made of deck X therefore..”
A few days ago, Wizards of the Coast banned Felidar Guardian. “Saheeli-Felidar’s win-loss ratio and metagame share has actually increased since the release of Amonkhet. In Magic Online Standard Leagues since Monday, Saheeli combo has made up approximately 40% of 5-0 and 4-1 decklists—up from prior to Amonkhet’s release.” is one of the few justifications offered by WotC (which is regrettable since they had much more sane arguments around). It’s the same kind of justification given by some old school Swedes to justify not doing more about The Deck, or to plead that the metagame is healthy. Things like “it’s only 20% of the top 8” are used. There could be a misunderstanding also. Possibly some consider that if a deck isn’t played all that much then the format’s healthy, whereas others consider things differently : if they find that deck to be only beatable by accident, then the format’s not healthy since they’re not encouraged to try things, but to either play the best deck or play for non-competitive fun, which isn’t always what people want to do, especially in tournaments of a competitive game ! One regards excessive redundancy as the enemy, the other the excessive power of the best deck.
Anyways, this is one of the easiest myth to debunk. What share of the players of that tournament did play deck X ? What to make of Y if we don’t know how many people played X ? It makes a world of difference if the deck is played by 5, 20 or 80 % of the players.
As a young(er) William Spaniel said : ”
If Jund is good, then it will have a lot of top eight finishes.
Jund has a lot of top eight finishes.
Therefore, Jund is good.
A lot of people believe that having a lot of top eight finishes automatically makes a deck good. In the light of the affirming the consequent fallacy, you can see the issue here. There is undoubtedly a correlation between having a lot of top eight finishes and being a good deck. But there is no guarantee. Here are some other things that may be pushing that deck to the top of the standings:
2) A small number of tournaments in the observation (similar to luck).
3) A substantial number of players piloting the deck (that is, a lot of top eight finishes and a lot of bottom eight finishes).
4) A biased sample.
5) Good players pooling on the same deck.
Put differently, Jund may have a lot of top eights in these tournament merely because the players piloting the decks got really, really lucky—and if we looked at a large number of tournaments, the Jund ratio would diminish significantly. Or there are so many players running Jund that Jund populates a substantial portion of the top eight, the next eight, the eight after that, all the way down to the bottom eight. Or there might be a spurious element about all of the tournaments that we are looking at that biases Jund toward winning that we do not expect to encounter in the future. Finally, Jund might just be winning because all of the good players are using it, and so Jund’s victories are more of a result of play skill than optimal deck choice.“
Some people might argue or I would, playing devil’s advocate, that if a deck is the best it will get played the most, and that if a deck is played a lot then it therefore has to be the best. Since so many people will overrate the significance of the early -possibly accidental- winners, the vicious circle is bound to regularly be started : people see what won the first tournament, many of them start playing it, the deck is actually quite good (though not the best), so by the increased adoption it gets more memorable, spectacular results. Next tournament, more people adopt the deck, etc. This was what William Spaniel observed in a long series of study of Standard’s deck rankings in 2010 : “The myth of Jund’s brokenness in a self-perpetuating prophesy. People assume Jund is the best deck, so they play it en masse, which results in it having always a large number of top 8 appearances in spite of being an inferior deck. Last week in the comments I made an analogy to the siege of Leningrad. Team Jund is the Red Army. Collectively they are numerous and can just send an infinite number of poorly equipped men to be mowed down by superior weaponry, which means at the end of the day Team Jund will always be putting up some winners in the top eight (the ones that have survived the butchery). But if you are playing Jund the odds of YOU in particular being in the top eight is about the same as one of the Red Army’s common grunt making it alive through the siege of Leningrad. Pretty poor.“
Numbers from a tournament or a short period cannot tell us if a deck had a target on its back, they could tell us if it will have one. They won’t tell us what was the deck to beat beforehand, we can tell that to ourselves when we realize that the first thing we do when trying to brew something competitive, we think about how to beat or resist the deck. Chances are most people do. So even if the adoption of the deck were smaller than its share of the top 8 there’d still be that parameter to take in consideration.
The mirrors. The more a deck is played, the more mirrors will take place. Imagine a tournament infested by the top deck du jour. The deck doesn’t get an incredible share of the top 8, despite its apparent domination, but we cannot derive from that data that the deck isn’t that great. It doesn’t speak much to the potential advantage deck X has over other decks, since so many of the matches played won’t see the deck having an advantage at all. The more a deck gets played, the less its superiority (if any) will matter. A deck could still be too oppressive and just have a not particularly impressive “conversion rate” (put into quote, since often in Pro Tours they use conversion rate to indicate the share of a deck’s pilots that reached day 2, not the share that reach top 8), in fact in the most extreme case, everybody plays the reputed best deck, and yet the best deck disappoints with perfectly fair conversion and win rates. The conversion rate, or ratio between its top 8 presence and its presence in the related tournament(s) cannot tell you those things, and yet it isn’t limited to the lone top 8 share.
Then there’s also the matter of the draws. The Deck, or any slow deck, tends to get a relatively big amount of draws, and this is to be compounded with the effect just above, since those happen even more in the mirror. On the one hand, if a deck does get so many draws it’s its own fault, so to speak. On the other, if there are usually still a considerable amount of those that make it to the top 8, then that restriction won’t apply then, and while that doesn’t affect who’s in the top 8, that does affect the oppression that deck has on the format, and thereby lesser the signification of the top 8 share furthermore (a bit overkill, but one is tempted by exhaustiveness when working on such lines).
2) “I’m 60-40 vs. deck X”
Said by : me, a frightening amount of pros and other fools who just play 10 matches and decide that that gives them pretty much their winrate in the matchup.
Most serious matchups we’ll want to test are, in all likelihood, not very imbalanced. They’re probably not much over 60/40 matchups, if that. Matchups that would be skill and variance dependent most of the time, but the spike will still want to know. So we playtest the matchup. 10 games, or maybe 20, probably not 30, that’s an awfully long time. That’s convenient, we can calculate the winrate percentage easily afterwards. Unfortunately that just happens to be an unreliable amount of matches to get any serious stats about such a matchup. The smaller the sample of matches (or whatever we intend to get stats from), the bigger the chances those stats are accidental and not very realistic.
Take a coin toss, the coin will give about as much chances to get a head than a tail (most coins use more metal to engrave the head than the numbers on the tail side, so they generally favor a “head” result), but it will rarely give those outcomes in a balanced, regular way, even if it’s a fair coin, as we’ve seen in this article already. So the size of the sample we need to assess the unfairness of a normal, that is, still very fair coin, is probably bigger than you’d imagine. On the other hand, take an extreme case of an extremely unfair coin that almost always give you the same result. We won’t need nearly as much tosses to get enough data to reasonably assume that those results most likely aren’t due to chance. So the required size of your sample actually depends on the reality of the coin, matchup, etc. But if we don’t have an idea about it beforehand, we’ll have to assume that all the possible outcomes are equally likely and will need the biggest sample that the “theory” asks. The “industry standard” is to find a p-value inferior to 0.05 -more on that in the next section.
But good luck getting that in a 10, 20 or even 30 matches sample -unless the result we get indicate that one deck gets trounced by the other, which isn’t the kind of matchup we’re considering here, so we won’t. And by trounced I mean even if a deck goes 7-3 vs X, it’s still way off the 0.05 target, though 14-6 works, as does 20-10. My buddy Myrdin put in a lot of work and went all the way to 50 matches ! (oh and for an impressive array too) And indeed with such a sample size we can get a p-value under 0.05 for a 60% winrate, which seems like what we’re looking for (although it won’t give us statistical significance for anything under that 60% bar), but although Myrdin proved it to be humanly feasible, it’s not realistic to expect people to use such a method on a regular basis (please don’t jump and use it before reading the next section).
This doesn’t mean that testing a matchup and taking notes of the results is a terrible strategy. It’s still quite good, since to get good at magic in general and in a specific matchup in particular you have to play, and it’s actually even better if you make the data unusable, since it means you adapted your decklists as a result of understanding the matchup better, which is the actual point. So no, pro-teams aren’t necessarily completely doing it wrong if and when they test a match-up for a “large” amount of games : they’re great at understanding the match-ups, adjusting their decks to it, and improving their strategy and tactics in playing them. They’re fooling themselves if they go by “we’re 60/40 against X”, but they still can be successful since thanks to that process they did what you really have to do to get better : play.
3) “The deck’s winrate is above any other deck’s winrate, and it’s statistically significant therefore it’s the best deck” / “blah statistically significant blah” / “95% confidence interval” / “p-value under 0.05” / “95% sure”
Said by : Frank Carsten, William Spaniel, Myrdin, most : psychology, sociology, medical researchers, just about any researcher in any field that use statistics (physicists and other serious people excepted), Myrdin, and ever more people despite all efforts to stop that plague.
Before we look into the science of it : there’s high chances that not all the decks, that are compared to “statistically significant” decks aren’t all that statistically significant with the famed p-value under 0.05 in a “big enough sample”. What can be said about those ? If we say that a deck’s winrate is reliable because the formula/excel/online calculator says that statistical significance is obtained, then by that same logic, the lesser played decks aren’t reliably rated, and therefore at best we have a reliable estimate of our deck’s winrate, but we can’t rank it, unless all the decks played had statistically “significant” stats.
The problem with watching tournaments in order to evaluate what are the best decks (and it’s not just in a statistical examination, but also in just a cursory observation or the one of the spectator), is that some excellent but “rogue” decks will most likely be ignored or badly rated (despite the insufficiency of the available data on the rogue deck). It is therefore good that not everybody uses such a method as that wouldn’t work to promote innovation, quite the contrary, which -as it happens- is exactly one of the main problems faced by standard. The criticism on netdecking is often a bit raw, but it’s got something right, though : netdecking often leads to the over-representation of some successful and already highly represented decks and the underrating of some others, although too often those aren’t even rated at all.
“In the theory of probability and statistics, a Bernoulli trial (or binomial trial) is a random experiment with exactly two possible outcomes, “success” and “failure”, in which the probability of success is the same every time the experiment is conducted.” as wikipedia said. With a nice failure/success basis, it’s no surprise that’s the kind of statistics typically used to do magic stats. Actually -and in accordance with previous arguments, a magic match isn’t binomial, since it’s only so if not timed. If draws were counted and they happen to be generally, we’d want to have the points per match rate, not the winrate. But and that’s probably just because I’m not a statistician (oh yeah I forgot, disclaimer : I’m not a statistician) and not willing to do what seems to be a lot of work for very little benefits, I don’t see how to use sample magic data to get such a probability. Multinomial trials, from what I’ve gathered, ask of you to know the probability of each possible outcome (loss, draw, win) beforehand (in the binomial test, the expected probability of each results is 50%, we test against the hypothesis that a match is like a coin-toss). “For example, suppose that two chess players had played numerous games and it was determined that the probability that Player A would win is 0.40, the probability that Player B would win is 0.35, and the probability that the game would end in a draw is 0.25. The multinomial distribution can be used to answer questions such as: “If these two chess players played 12 games, what is the probability that Player A would win 7 games, Player B would win 2 games, and the remaining 3 games would be drawn?”” So it seems like we need to know beforehand the probabilities to ascertain the likeliness of the sample to arise.
But magic isn’t chess. I’m not sure it makes sense to get the overall chances of drawn games (over the entire history of computable official timed matches I guess), to assess the points-earning rate of a specific deck, let alone a particularly fast or slow one. But unlike in the binomial case, where it makes sense to test against the assumption that a match is a coin-toss, it doesn’t make sense to obtain beforehand the probability of getting a draw for a deck if what we’re trying to obtain are reliable probabilities for that deck. How will we get the expected drawn matches probability ? If we have a good way to get it, then we’re already saying that we have a good sample of results to get the stats for that deck. But of course a sample cannot justify its own quality, so while I imagine versed people would have ways to deal with such a problem -and they’re welcome to correct and enlighten me, I can’t for myself affirm that it makes sense to calculate the points per match probability, which is actually the one we’d want if we were serious about it.
Of course all that is ignoring intentional draws. ID will skewer the results in general but even more those of the most successful decks. They will do it to their detriment, since the successful decks have a better expected points earned per match than the 1 they gain from a drawn match.
But almost everybody who do magic stats do ignore draws, do as if win or loose were the only results of a magic match and use binomial tests in magic stats, so let’s go with that.
As we’ve seen, with binomials you deal with experiments “in which the probability of success is the same every time the experiment is conducted“. But as it tends to be with people the more you test them the more your test will fail. For instance if queried as to whether they’re alive or dead -with success being “alive”, the initial assumption would be that the probability of answering dead is the same every time.
In practice, the bigger the sample the bigger the chances that, overtime, the queried subjects have heard about the experiment beforehand, and probably the more dead answers we’d get, because people are funny that way. Even if you don’t ask anything to people directly, the human factor will intervene. In mtg that’s the opposite of the surprise effect happening : the more people play, and assuming they would directly or not be vastly influenced by the reputed good solid stats, the better they’re prepared to beat the deck.
Also, probably the lesser the average skill of the deck’s player. Spaniel had to realize, over and over, that the more a deck would become known as the best and adopted the worse it would fare from that point on, with typically a point where it gets known as such and a clear decreasing in its winrate. Of course, assuming a metagame were to stay the same, this would have to stabilize, otherwise any deck would ultimately lose all of its matches, which doesn’t make sense : “In any case, I am going to revise my expectations: now, I believe UW Control will further decline. The deck seems to be bipolar depending on who is playing it. UW mages at the beginning of the season were likely all really good players and smoked the competition with it. Then other people got on the bandwagon, and the deck started failing them. The interesting question is whether we would see a similar effect with other decks. For example, if only 1800+ players ran Jund, would its win percentage eclipse 70%? Or is UW an especially delicate deck to play? Unfortunately, I do not have a clever way of deducing that with the data I currently have.”
Meanwhile, the player, while Spaniel is finding deck x to go down, is finding his ability, and that of others like him in playing against X is improving steadily : he’s not really impressed by the statistician’s approach, the player understanding intuitively, and out of experience what’s going on, while the data looked in isolation seems confounding.
And even if we were limiting the sample to a single tournament, we’d still have people learning a bit, as new decks become known and scouted, improve their sideboarding and strategy against new decks over the course of the tournament. So it’s disputable any sample of magic results of a widely played deck satisfy that experience-independant success-rate condition to apply the formulas. We already know that, but not only magic is played by people, and people are changed by their experience, but also magic is a skill game, first and foremost. But let’s go over that and assume the data can be of a sufficient quality for our goals.
Still, even ignoring all that and having a sample size “big enough” to get “statistical significance”, the test is quite random :
“Never, ever, use the word “significant” in a paper.“
“Observation of a P value close to 0.05 means nothing more than ‘worth another look’.”
In an influential paper David Colquhoun didn’t just criticize (and dismember) the misuse of p-values and statistical significance, he actually calculated the range of detection error rate of studies that limit themselves to p<0.05 : “If you observe a P value close to 0.05, your false discovery rate will not be 5%. It will be at least 30% and it could easily be 80% for small studies.“ If you ever read something about a study that produces strange results but still is taken seriously because the study is done by “researchers”, “scientists” and may even be “peer reviewed” (that is, just about any studies you hear about on a daily basis), then look no further : a lot of what we know is based on such pathetically weakly funded data. Most of those pseudo-research do produce pseudo-results, pseudo-science and pseudo scientific publications. That is, modern times charlatanism. Oh and when they produce non-spectacular results : you won’t hear about them, they’re probably still not solid enough, and the funds won’t be granted in case someone wanted to go the extra mile.
So what size would we need to actually have a low error rate ? Depends on what level of accuracy we’re aiming for. Are we that deep that even a 55-45 advantage would be relevant to us ? Let’s take a 60-40 minimum winrate as our first target. The way to assess the detection potential of a sample is calculated as statistical power. The numbers shown there clearly show that even a 100 matches sample has good chances not to detect a 60% winrate and that as we’ve already hinted at, a sample of 10 can only detect extreme probabilities. But with a sample of 1000 matches we’re good ! Well, that’s a relief. But what if we wanted to know the minimal sample size for a desired accuracy ? As can be computed there, even if we were ready to accept a 20% error rate of detection we’d still need 188 matches -and 776 for a 55% winrate. If we were serious, we’d want to consider 600 -2 or 3 thousands for 55%. In other words, we’d probably never get an extremely good sample of data for mtg, because how could we get hundreds or thousands of relevant matches results, considering we can’t get an army to play them, even if we had such an army it wouldn’t be good enough at magic, and if we could get them from magic online results, we’d need to look into a deck widely played, that is which would suffer from a bad average pilot skill, and extremely focused adversity.
Which means that every time someone has said that a deck X beats a deck Y, every single occurrence in the history of Magic (and in fact in the history of almost every games) is an occurrence of someone talking out of their ass, from a statistical standpoint.
But can we really say that all of them had no idea what they were talking about and when right were solely right by happenstance ? No, that includes many top players, probably some of the best players and deck designers ever. We magic players know more than just the winrate stats of our decks, assuming we know those, which is generally not true. We’re often wrong, and yet we know more than just the stats, worse : in fact we often don’t even take note of those during our test sessions.. and we’re not necessarily all that wrong in that, since we basically never play enough personally to get enough results to reach a satisfying amount of tests in our sample. I’d still take those notes, and I generally do in a testing session, but that’s only a minor element in my assessment of a matchup.
4) “What I did right was that I chose to play the right deck.”
Said by : a lot of competitors, including pro players, right after they’ve just won a tournament.
Well, Bob, you won, so your deck must be awesome right ?
And you didn’t win for a while, so it must be that the decks you played back then were underpar..
This is in disguise a stats’ myth, since the modest sample size of the number of games played in said tournament is forgotten and the lazy way to explain things with the magical thought our brain is anxious to deliver to put a clear and simple narrative on things isn’t easy to ignore. Note that as we’ve seen many people will think that the winner’s deck has to be (one of) the best. But in this case that statement is uttered many times by famous, top-rated players, and is also the opportunity to see how it’s not just how deficient we are at evaluating non-visualized proportions, but also how the winners themselves further enhances that evaluation bias of ours.
5) “The deck is great, but too many people play it and the mirror is like a coin-toss.”
Said by : plebs.
So what’s the point ? After all, it seems we could just pick the supposed best decks, and while the “stats” can’t really justify it as such, still it would seem that it doesn’t make much of a difference. But some would consider the mirror match to be a problem. I don’t think they should. Almost nobody would get a 50% chance of winning against Jon Finkel playing the same deck. In fact even non-Finkel me clearly remembers a time, a long time even, when he had to face a sea of mirrors. That was the Black summer, though that era extended in both directions really, and there never was a time when I won more, with more consistency. Playing Necro against a sea of Necro, all 96-long : great ! And that makes sense to me not because of fuzzy data and stats, that’s because I almost never saw anybody play Necro to my satisfaction, that is, because those results were absolutely consistent with my understanding of what was going on in the matches. Necro I guess isn’t that easy a deck to pilot, and the differences in tuning matter. And it’s the same with every deck and era, it’s always been like that. When we face people of equal skill it’s a coin-toss if both play the same deck, sure, but since we’re of equal skill, it would be a coin-toss in principle anyways. If we face people of better magic skill, same thing, it doesn’t matter what we’ve chosen as our deck. “I’m an average magic player at best” is what that statement really means.