Silph Study: #004

Published:
11.15.2016

Does Anything Influence Your Pokemon's Evolution Moveset?

"Is there ANYTHING I can do to influence a Pokemon's moveset?"

When a Pokemon evolves, its quick and charge move are re-rolled according to unknown odds. Conventional wisdom has settled on the idea that the moves are chosen randomly from those available to that Pokemon species. Unfortunately, no study has seriously examined evolution moveset factors, and due to a lack of this data, myths and rumors have perpetuated.

Fortunately, the Silph Research group has taken a look!

When a Pokemon evolves, its quick and charge move are re-rolled according to unknown odds. Silph researchers began recording various attributes of their Pokemon before evolution, including their moves, appraisals, Stardust, and their evolved movesets. Over 10,000 evolutions were captured over the course of the study.

Findings

After thorough examination, the Silph Research group has come to two major conclusions:

Finding #1:

We have found no evidence of correlation between the following factors and post-evolution movesets:

Factor Examined Correlation Found?
Pre-evolution Quick Move No
Pre-evolution Charge Move No
Trainer's Team No
The Pokemon's Level (approximated using Stardust power-up cost) No
Pokemon Nicknamed at time of evolution No
Lucky Egg activated during evolution No
Overall Appraisal rating No
Highest IV category(ies) (Attack, Defense, or Stamina) No
Pokemon has perfect IVs in Attack, Defense, or Stamina (Separately) No
Whether the Pokemon has any perfect IV No
Whether the Pokemon has all perfect IVs No*

Our starting hypothesis (the null hypothesis) is that, after evolution, a Pokemon's Quick move and Charge move are selected in a uniformly random fashion from the moves available, that is, each move is equally likely and is selected at random from the available moves.

The data suggests that, of the effects we looked at, none caused a significant deviation from random uniform selection. This shows that none of the factors explored above made a meaningful impact on the post-evolution moveset.

Finding #2:

We have found movesets to be evenly, randomly distributed post-evolution, meaning if a species has 3 possible post-evolution charge moves, each move has a 33.3% chance of being the Pokemon's new move.

So, travelers, if you see any evolution moveset myths floating around, share the knowledge: it's truly random, and in general has an equal chance every time. And that's no longer just a hunch!

* Data was scarce in the all perfect IVs sample group, so while our data showed no contradiction to uniform random move assignments, it is possible that more data might contradict this trend.


The Null Hypothesis and Methodology

Each potentially-correlated pre-evolution attribute was examined one at a time. For each possible correlation, the null hypothesis applies to each permutation of species and the variable being considered.

For example, consider the hypothetical "Trainer Team" correlation. Starting with the Bulbasaur in Mystic, the null hypothesis would say that about one-third of charge moves would be Sludge Bomb when evolved to Ivysaur. This would be a single test for our hypothesis, comparing our observed count with the the distribution expected by the null hypothesis.

Since there is a similar test for every permutation of pre-evolution variables, there is a very large number of tests. Each test consists of a starting species, the independent variable being tested, and a post-evolution move. (See Note 1) Analysis was then performed on the entire system of tests as a whole, to see if any of the hypothesized factors were impacting multiple species.

Given the null hypothesis, one can calculate a metric of extremeness for each test. If the null hypothesis is true, extremeness should be randomly and uniformly distributed between 0 and 1. (See Note 2) However, if there is a systematic association between any independent variable and post-evolution moves, the distribution of extremeness will deviate from a uniform distribution.

Thus we can test our null hypothesis by testing whether our values of extremeness appear to be drawn from a uniform distribution. This was conducted using the Anderson-Darling Test.

To illustrate how this test was performed, we will show the results from the variable 'Trainer Team' at each step:

Step 1: Calculate Cumulative Probabilities

Here is the cumulative probability ("extremeness") of all the Trainer Team tests. We have visualized it for reference, below:

Cumulative probabilities like that for Trainer Team above are similarly run for each independent variable to post-evolution move.

Step 2: Calculate the p-value of the Distribution

We then measure the extremeness of the distribution above compared to a uniform distribution using the Anderson-Darling test.

Hypothesized Variable Anderson-Darling p-value
Trainer Team 0.4686

The p-value of the distribution in this case is not below .005 (which is the required p-value at the 95% level, after correcting for multiple comparisons) and thus does not differ from the random (uniform) distribution in any meaningful way.

Similarly, every single variable examined failed to differ from a uniform distribution in this way. Consequently, our tests have failed to detect any meaningful correlation between the examined factors and a post-evolution moveset.


Notes

Note 1:

Tests were formulated to ensure that each is independent from the others. For example, consider 45 Ivysaur, out of which 15 learned Sludge Bomb, 18 learned Seed Bomb, and 12 learned Power Whip. First, we would test whether 15 out of 45 (pr=1/3) is significant for Sludge Bomb, and then exclude the Sludge Bomb Ivysaur from further tests. Then we would test 18 out of 30 (pr=1/2) for Seed Bomb. In principle, there would also be a test for Power Whip, but since this would be uninformative (all 12 out of the remaining 12), it is ignored entirely and not counted as a test. (The multinomial distribution was not used as the discretization correction described in Note 2 is not easily apparent.)

Note 2:

Our measure of "extremeness" is constructed from the cumulative probability of each test through calculation related to that of the observation's p-value. We built this measure from the p-value because, for continuous distributions when the null hypothesis is true, we know that randomly sampled p-values are uniformly distributed. (The p-value is the probability of observing a sample at least as extreme as the one that was seen.) However, because the binomial distribution is discrete, the binomial CDF (and any p-values) are not uniformly distributed, and the specific values produces depend on the parameters to each individual test's binomial (i.e. the number of evolutions and the probability each). This discretization introduces spurious deviations, so a correction must be applied to create a statistic that is still distributed uniformly when the null hypothesis is true.

As a concrete example, consider all permutations of five trials producing either 'A' or 'B', and we observe two 'A' out of five. Each permutation is equally likely. When calculating the binomial CDF, clearly Pr(0 out of 5) and Pr(1 out of 5) are "fewer" successes and thus included in the accumulation. But then there is a large number of uniformly-likely permutations which result in the same two-out-of-five observation (i.e. ABBAB is the same as BAABB). In effect, these equivalent permutations are "tied" with each other.

The textbook definition of the CDF is inclusive - that is, it is the probability of getting successes less than or equal to a certain value. If we were to use the CDF naively, our values would always include all of the "tied" observations, equivalent to always re-ordering the permutations so that our observed permutation was "largest" among them. Therefore, the calculation becomes biased high - this would be fine for a significance test (especially when using high sample sizes) but does not fit our current purpose. Similarly, exclusion leads to a low bias. Selecting the average between the inclusive and exclusive versions creates discretization artifacts at certain values that are common (i.e. 0.5). None of these are appropriate for our usage case, where we know the resulting distribution should be uniformly random.

One solution is to select a uniformly random value within the range between the exclusive and inclusive values. This is roughly equivalent to doing random tie breaking among the cluster of "tied" permutations, and produces a statistic with the appropriate properties. (Note that the random change is always less than the change that would be introduced by subtracting a single 'successful' observation from the binomial test.) This solution was used to produce the histograms and perform the Anderson-Darling Tests. Since this procedure is random, the analysis has been restricted to using to a particular default PRNG state. We also manually verified that, generally, our results did not differ when using different random numbers.

Note 3:

Though our analysis shows that for the variables we accounted for the movesets are assigned through random, uniform assignment, the possibility still remains that other possible variables or non-systemic species/move factors might exist (for example in the case of the well-known Eevee Easeter egg when nicknaming Rainer/Sparky/Pyro)

Note 4:

The test is designed to check whether our data fits the null hypothesis. This means that it's possible to disprove the null hypothesis when the data seems to come from a different (than uniform) distribution, but it's impossible to prove that the null hypothesis is true. However, using simulated data, we believe our test with this sample size is powerful enough to detect differences of about 5%pt or more in the probabilities of being assigned each move.


Publication

This finding was shared on our subreddit on Nov. 15, 2016.