Analyzing experiment results

Summary of A/B test course by Google on Udacity (Part 4)

Analyzing result is the last part of the course A/B testing course by Google on Udacity. What we can and cannot conclude about the result of experiment. This part is about examining sanity check, evaluating single and multiple metrics, also gotcha of analyzing metrics will be discussed. You can refer to my notes of previous lectures 1&2, 3 and 4.

  1. Sanity check

Before interpreting result of experiment we need to check sanity first.

There are two main types of check:

  • Population sizing metric based on your unit of diversion. Check control and experiment population are actually comparable.
  • Invariant metrics: Metrics should not change when you run experiment.

Eg: Checking if pool size of control and experiment group are the same?

In the above example, observed fraction returns 0.5104 which is not within interval [0.4947,0.5027] then control and experiment size is not similar. It does not pass invariant metrics or sanity check fail. In this case, come back to check why it fails?

Most common reasons: data capture, experiment setup.

2. Single metric

After the sanity check passes, we can start analyzing metrics. If the results is statistically significant then you can interpret the results based on how you characterize the metric and build intuition from it.

Check the variability of metric.

Effect size

Captured from the course

Assume that sanity check passes, we want to know the change is worth it for business metric. Or does experiment group gets higher CTR than control group or not?

The number of sample in control and experiment group are not the same, but fairly close then we can assume that standard error of experiment is proportional of empirical standard error divided by empirical scaling factor as in the above image.

The confidence interval ranging lies in [0.0065, 0.0167]. d_min (0.01) lies in confidence interval so cannot reject the null hypothesis that CTR of experiment and control group are identical at 95% confidence level. Or it is not recommended to launch the experiment.

Let’s double check with result of sign test.

Captured from the course

We look at experiment result day by day. In 14 experiment running days, record 9 days that CTR of experiment group is higher than control group. Using sign test calculator to calculate two-tailed p-value returns 0.4240 >> alpha (0.05). Meanwhile p-value is a measure of the probability that an observed difference could have occurred just by random chance. The lower p-value, the greater the statistical significance of the observed difference. P_value > alpha => reject null hypothesis => statistically insignificant. Sign test result aligns with effect size result. The final recommendation is still not launch the change.

What if the result of size effect and sign test contradicts with each other? It is frequently the case in non-parametric test when you have no assumption about distribution. The contradiction between result of sign test and size effect is the example of Simpson’s paradox. It means that there is bunch of subgroup in your data. Between each group, the results are stable but when aggregating them all together, it’s the mix of subgroup that actually drive your result. Sign test has less power than the size effect. If sign test returns statistically significant and size effect reverse then it is not worth the red flag but might need to dig deeper like analyzing by segments such as behavior difference between weekday or weekend.

3. Multiple metrics

We can use multiple comparison when for example we have automated alerting. See if suddenly metric that behave differently occurs. Or if we use automated framework in exploratory data analysis, you want to make sure that the metric is occurs and the different is repeatable.

The more metrics you test, the more likely you are to see significant difference just by chance. It is the problem but not be sunk because it should not be repeatable. It means if you do the same experiment on the another day or divide into slices you did some bootstrap analysis you will not see the same metrics showing up as significant differences every time, it should occur randomly. Another technique called multiple comparisons that adjust your significance level. Multiple comparison require automated alerting if sudden metric behaeves differently.

Captured from the course

Problem of tracking multiple metrics, overall alpha increases

For example we measure 3 metrics and each metric are set at confidence level 95%. The probability that all three metrics are non false positive is 0.95ˆ3 =0.857 in case 3 metrics are independent. So the probability that at least 1 metric is statistically different is 1–0.857 = 0.143. When we increase the number of metrics, the higher chance of at least 1 metric is statistically different increases.

How to tackle the probability of false positive accumulated?

Method 1: Use higher confidence level for each metrics.

Assume independence alphaoverall = 1-(1-alpha)ˆn. If increase alpha à increase alpha overall.

Method 2: Bonferroni correction:

  • Pros: Simple, no assumption
  • Cons: Conservative because sometimes you will track metrics that are correlated and all tend to move the same time

Define alpha_overall as you want then alpha_individual = alpha­­_overall/no of metrics

Other methods more complicated method than Bonferroni correction

  • Control probability of any metric what show false positive
  • Control false discovery rate (FDR)

Multiple metrics have advantage if they confirm each other and reversely it will be their drawback if they contradict each other. As we mention in lecture 3 about composite metric — OEC which might balance these metrics.

Draw conclusion

After figuring it out which metrics have significant changes, which metrics not you have to decide what your results do and don’t tell you. If you have statistically significant results then that means that you cannot have zero impact on user experience but now the questions come to: a. Do you understand the change? b. Do you want to launch the change?


It is better to start experiment with small amount of population then scale up later. But it also risk that when you add more samples and more experiment later, the changes are flattened out.

The factors such as changing over time, novelty effect, learning effect, etc. can impact on result after experiment. To deal with it you can use cohort analysis to limit for particular subset of users.

In conclusion, always check your experiment carefully. Consider cost and benefit from launching.



Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store