Why we completely changed the A/B testing methodology

Jun 5, 2018 IN Data Science

In this article, I want to explain why we decided to completely change the way we perform A/B tests in Pixel Federation. I will be comparing the method we used before to the new method in the context of A/B testing new features in games.

We usually test two versions of the game. Version A is the current version of the game. Version B is different with a new feature, rebalance or some other change. About a year ago, we were using the null hypothesis significance testing (NHST) known also as frequentist testing. Specifically, we used Chi-squared test for testing conversion metrics and t-test for revenue metrics like ARPU (average revenue per user) or ARPPU (average revenue per paying user). Now, we are using Bayesian statistics.

I will mention these three reasons why we switched from NHST to Bayes:

  • interpretability of results
  • sample size and setup of the test
  • testing revenue metrics

But these reasons are valid only in a business context. They can be irrelevant in other applications like scientific research.


How to interpret the results?

The frequentist and Bayesian statistics have a very different approach to probability and hypothesis testing. So, the results we get from them with the same data are also quite different.

Frequentist tests

The frequentist tests give us p-values and confidence intervals as results. By comparing the p-value to a predetermined alpha level (usually 5%) we conclude whether the difference is significant or not. This approach has been used for over 100 years and it works fine. But there is one big problem with p-values. A lot of people think they understand them, but only a few actually do. And this can lead to a lot of problems like unintentional p-hacking.

The reason is that p-values are very unintuitive. If we want to test that B is better than A (or A better than B), we have to form a null hypothesis assuming the opposite which is that B is equal to A. Then we collect the data and observe how surprising they are under the assumption of B being equal to A. The p-value then tells us what the probability is of observing even more surprising data. By surprising under the null hypothesis, I mean that there is a big difference between A and B.

In other words, the p-value tells us what the probability would be of getting an outcome as extreme as (or more extreme than) the actual observed outcome if we ran an A/A test. This is the simplest description of p-value I know but it still seems too confusing.

I am not saying that complicated things are not useful. But in our business, we need also managers and game designers to understand our results.

If we get a result of p-value less than alpha, everything is fine. We make a conclusion that B is significantly better which has a clear business value. Everyone will be happy that we made a successful A/B test and no one cares that they do not understand what p-value of 3.4% actually means.

But what if the p-value is higher than alpha? It is hard to make a conclusion that has some value from this outcome. Not only we cannot say that one version is better, but we also cannot say that they are the same. Now, our only result is a p-value equal to 7%. We also have a confidence interval but that is useless since it contains both negative and positive numbers. Now, the chances are that people from the production team will not understand what p-value equal to 7% means and if you somehow manage to explain it to them it can be even worse. Because who cares what is the probability of getting an outcome more extreme than the observed outcome under the assumption which is the opposite of what we hope for?

In this situation, it can be tempting to say “Well, 7% is quite close to 5% and it looks like there is an effect. Let’s say that B is better”. But you cannot do that, because p-values are uniformly distributed between 0 and 1 if there is no real effect. That means that if there really is no difference between A and B, there is an equal chance of p-value being equal to 7% and to 80% or any other number. So, the only thing you can declare and keep some statistician’s integrity is that we do not have enough data to make any decision in the current setup of the test.

This is very disappointing because A/B tests are quite expensive. From the manager’s point of view, when we run a test for a few weeks and then analysts conclude that we know basically nothing it can look like a waste of time and money. I saw how this leads to frustration and unwillingness to do A/B tests in the future. And this is not what we want.

Bayesian tests

Bayesian approach solves this very well, because it gives us meaningful and understandable information in any situation. It does not operate with any null hypothesis assumptions and this simplifies a lot in terms of interpretability.

Instead of p-value, it gives us these three numbers:

  • P(B > A) – probability that version B is better than A
  • E(loss|B) – expected loss with version B if B is actually worse
  • E(uplift|B) – expected uplift with version B if B is actually better

E(loss|B) and E(uplift|B) are measured in the same units as the tested measure. For example, in case of conversion, it is a percentage. Another useful result is highest density interval. It is an equivalent to confidence interval in NHST. To better explain this, let’s consider an example. We do a test and get these results:

1

It means that if we select version B and we make a mistake (probability 5.1%) we can expect to lose 0.04% resulting in 19.96% conversion. On the other hand, if we choose correctly (probability 94.9%), we can expect to gain 3% resulting in 23% conversion. This is a simple result that can be well understood by both game designers and managers.

To decide whether to use A or B, we should determine a decision rule before calculating the results of the test. You can decide based on the P(B > A) but it is much smarter to use E(loss|B) and declare B as the winner only if E(loss|B) is smaller than some predefined threshold.

This decision rule is not as clear as with the p-value, but it is actually a good thing. The “p-value < 5%” has become so wired in the brain of every statistician that we often do not even think about whether it makes sense in our context. With Bayesian testing we should discuss the threshold for E(loss|B) with game designers before running the test. Then we should set it to a number so low that we do not care if we make an error smaller than this threshold. For example, when we tested levels in our match-3 game, we used the threshold of 0.1% for E(loss|B) on the conversion to next level.

But even if the predefined decision rule does not distinguish between variations (because we do not have enough data or we set the threshold too low), we can always do at least an informed decision. We can compare how much can we gain with the probability of P(B > A) and how much can we lose with the probability of 1 – P(B > A). This is the key advantage of Bayesian testing in business applications. No matter the results, you always get meaningful information that is easy to understand.


Sample size and setup of the test

So far, I have only mentioned issues about the understanding of the test results. But there is another very practical reason to use Bayesian testing regarding the sample size. The most often mentioned drawback of NHST is the requirement to fix the sample size before running the test. The Bayesian tests do not have this requirement. Also, they usually work better with smaller sample sizes.

Frequentist tests

The reason that fixing the sample size is required in NHST is the fact that the p-value is dependent on how we collect the data. We should calculate the p-value differently if we set up the test with the intention to gather observation from 10 000 players compared to the test where we intend to gather the data until we observe 100 conversions. This sounds counterintuitive and it is often disregarded. One would expect that the data itself are sufficient, but it is not the case if you want to calculate p-values correctly. I am not going to try to explain it here since it is very well analyzed in a paper (see Figure 10) by John K. Kruschke and explained in even more detail in chapter 11 of his book Doing Bayesian Data Analysis. Another reason to fix the sample size is that we want some estimate of the power of the test.

This means that we have to determine how many players will be included in the test before collecting the data and then evaluate the results only after the predefined sample size has been reached. This is very impractical because if we underestimate the effect we will be running the test for too long. On the other hand, if we overestimate it we will end up with insignificant results leading to a situation I described above. Since we usually have no idea about the size of the effect, it is very hard to set up the test correctly and efficiently. And since A/B tests are expensive, we want to set up them as efficiently as possible. This topic is very well covered in an article by Evan Miller. He clearly shows how “peeking” at the results before reaching the fixed sample size can considerably increase the type I error rate. And this completely invalidates the result. He also suggests two possible approaches to fix the “peeking problem”: Sequential testing or Bayesian testing.

Bayesian tests

Bayesian tests do not require fixing the sample to give valid results. You can even evaluate the results repeatedly (which is considered almost a blasphemy with NHST) and the results will still hold. But I have to write this with a big warning! Some articles claim that Bayesian approach is “immune to peeking”. This can be very misleading, because it suggests the type I error rate is kept even if we regularly check results and stop the test prematurely when significant. This is just not true. No statistical method can do that if you have the stopping rule based on observing the extreme values. The fact is that Bayesian tests never kept the type I error rate bounded in the first place. They work differently. Instead, they keep the expected loss bounded. It is shown in this article by David Robinson that the expected loss in Bayesian tests is bounded even when peeking. So, the claim is partly true just be aware that being “immune to peeking” means one thing for NHST and something different for Bayesian tests.

I put the warning here in case you care specifically about type I error rate. In that case, the Bayesian testing may not be for you.

But I am convinced that type I error rate is not very important for us in the context of A/B testing new features in games. Sure, it is super important in scientific research like medicine. You obviously do not want to tell someone who has cancer that he is healthy. But the situation is different in A/B testing. In the end we always choose A or B. And if there really is no difference or the difference is very small we do not care that we made a type I error. We only care if the error is too big. So, in context of A/B testing it makes much more sense to bound the expected loss instead of type I error rate.

On top of this, there is another practical advantage in Bayesian tests. It usually requires considerably smaller sample size to achieve the same power compared to NHST. This can save us a lot of money, because we can make decisions faster.

Following plots show simulations of conversion tests. We simulated 1 000 Chi-squared tests and 1 000 Bayes tests for each sample size and calculated the percentage of tests that correctly declared B as a winner. That tells us the approximate power of the test with a specific sample size.

The dotted line at 80% shows the usual required power of a test. We can see that the sample size required to get 80% power is significantly lower for Bayesian tests (less than half in first simulation and 85% smaller in the second one). On the other hand, if we plotted the situation where the conversion of A and B is the same, the NHST would have better results (95% for any sample size).

For Chi-squared test, we used alpha equal to 5%. The Bayesian test declared B as a winner when the expected loss was lower than 0.1%.

1 1


How can we test revenue data?

Testing revenue metrics like ARPU is always difficult due to low percentage of payers and common outliers (big spenders). Here, I will be comparing the t-test with the Bayesian approach.

Frequentist tests

We were using the t-test for testing ARPU before switching to Bayesian approach. It has one issue and that is the assumption of normality of the data. When testing ARPU, the data are the aggregated revenue from each player. And these are never normal in free-to-play business. Usually more than 95% of observations are zero and the rest is heavily skewed. The test still works as promised meaning it keeps the type I error rate below alpha, but the power of the t-test dwindles when the conversion is low and the variance is high.

The assumption of normality was the first impulse that motivated me to look for alternatives to t-test. I was first looking at several different NHST tests like Mann-Whitney U test, Permutation test or zero-inflated regression models. But none of them worked better than simple t-test in our use case. This finally led me to Bayesian statistics.

Bayesian tests

With the Bayesian approach, we have a model designed specifically for revenue data in freemium business. We use it for testing ARPU or ARPPU. It assumes that revenue from payers is Exponentially distributed which is much closer to reality compared to the normal distribution. Based on the simulations, this test has higher power compared to t-test in situations representative for our games. This specific model was published by Chris Stucchio from Visual Website Optimizer in this paper. It also describes in detail the whole methodology of Bayesian A/B testing.


How it works in practice

After talking about all the reasons for the change I should also mention how we use it in practice. For now, we are using two separate Bayesian models for two different types of tests. Without detailed explanations (which would need a separate article), I will just list what we use right now.

Instead of Chi-squared test, we use a model that assumes that the conversion rate has Bernoulli distribution. All the formulas are from following sources:

For revenue data, we have an extension of the conversion model. It separately estimates the conversion rate with the Bernoulli distribution and then the paid amount using the Exponential distribution. I implemented it based on the already mentioned paper:

Code available with Kruschke’s book was also very useful in understanding the Bayesian statistics. I used parts of it to calculate the highest density intervals:

Using these sources, I implemented A/B test calculators in R. You can give them a try here:


Summary

Using Bayesian A/B testing, we can now carry out tests faster with more actionable results. Let me summarize the advantages below:

Frequentist Bayesian
p-value is difficult understand and has no business value results have clear value and are easy to understand
we cannot make an informative conclusion from insignificant test we always get valid results and can make at least an informed decision
test is not valid without fixing the sample size fixing the sample size is not required
bigger sample size is required smaller sample size is sufficient
t-test is not efficient for testing ARPU we have a more efficient test designed specifically for revenue data

But what are the disadvantages and why is it not used more often? In my opinion, the main reason is that it is far more complicated to implement compared to traditional NHST tests. They can be calculated using one function in Excel. Another reason some are hesitant to use Bayesian approach is the prior distribution. Prior and posterior distributions are key elements of Bayesian statistics. You have to determine the prior distribution before doing any calculations. In Pixel Federation, we are currently using non-informative priors for A/B testing purposes.

In conclusion, I want to repeat that I am not claiming that frequentist statistics is generally inferior to Bayesian. They are both useful for different purposes. In this article, I wanted to argue that Bayesian approach is more appropriate in business applications.

If you happened to face similar problems and solved them differently or you used Bayesian statistics and have some experience you wanted to share, please let me know at vgregor@pixelfederation.com or message me on LinkedIn.

Sources

Viktor Gregor
Data Scientist