Currently, I am following the course A/B testing by Google on Udacity. You can find the course link here. The course provide theoretical as well as practical parts so I highly recommend if you can spend your time learning it by yourself. This course is very useful but might be hard to follow if you prefer to learn by reading. That is why I summarize the course according to my understanding. Hope my note will save your time and effort.
My summary in general align with course’s structure. Sometimes I provide extra link or take definition from other source to clarify further.
Part 1: Overview of A/B testing
A/B testing or split test is a general methodology used online when you want to test out new product or a feature. You want to take 2 set of users then show an existing product/feature & a new version of product/feature then comparing how each user set behave different in different variant.
When it is useful and when it is not? As previous CEO of Mozilla’s analogy: A/B testing is useful for helping you climb to the peak of the current mountain but if you want to figure out whether you want to be in this mountain or another mountain, A/B testing is not so useful.
Common threat for A/B testing validity — Novelty effect. Novelty effect is the tendency for performance to initially improve when new technology is instituted, not because of any actual improvement in learning or achievement, but in response to increased interest in the new technology (Wikipedia).
Examples of A/B testing:
Example 1 — Audacity is an education website. The site is considering to change the button color in homepage from orange to pink. In this specific case:
- Hypothesis assumption: Changing the button color from orange to pink will increase the number of students exploring Audacity course.
- Metric choice: Click through probability (CTP) = unique visitors who click/unique visitors to page
Note: Metrics should be practical (not too long to conduct) & suitable to answer hypothesis.
Example 2— An extreme example from Google team. They could not decide between two blues so they tested 41 shades between each blue, showing each shade to 1% of their users to see which one performs better.
2. Statistics reviews
Many A/B tests examine binary outcomes (like buy/don’t buy) and typically use binomial distribution for statistical inference. It is needless to review some of statistics knowledge.
The binomial distribution is a probability distribution that summarizes the likelihood that a value will take one of two independent values under a given set of parameters or assumptions. Three underlying assumption of binomial distribution include: 2 types of outcomes, independent events (mutually exclusive) & identical distribution (same probability).
Properties of binomial distribution:
- Mean = p (probability of success)
- Std dev = sqrt (p*(1-p)/N)
- Confidence interval (CI): (Often choose value 95%) If we carry experiment over and over again, we would expect the interval we construct around our sample mean to cover the true value of sample population 95% of the time.
As a rule of thumb: to use normal, check if satisfied: N*p^>5 * N* (1-pˆ) >5
P^ (or estimated probability) = X/N ( X — # users who clicked, N — # users) or center of CI
m (or margin of error) = z* se = z* sqrt (pˆ*(1-pˆ)/N)
z-score depend on confident level (if 95%, z score= 1.96). Check z-score table or simply google it.
CI in [pˆ-m, pˆ+m]
- Statistical significance: How likely it is your result occurs by chance.
- Hypothesis testing: Null & alternative hypothesis
Null hypothesis (H0): pexp — pcont = 0
Alternative hypothesis (H1): pexp — pcont # 0
When comparing two samples, it needs to choose standard error that give us good comparison of both. It is then requires to calculate the pooled standard error.
pˆpool = (Xcont +Xexp) / (Ncont +Nexp)
SEpool = sqrt(Pˆpool*(1- Pˆpool)*(1/Ncont+1/Nexp))
d^ (difference) = pexp — pcont
H0: d = 0 or d^ distribute normally or ~ N(0, SEpool)
If estimated dˆ > 1.96* SEpool or dˆ < -1.96* SEpool, we can reject null hypothesis. Or two sample are statistically significant difference, we could launch. Else we cannot reject null hypothesis.
In case, dmin < part of CI range then we conclude that we might need additional test. In practical, we might do not have enough of time and resource to implement another test. We need to communicate with decision maker and sometimes take risks, because data is uncertain. Decision maker might use another factor such as strategic business issue or other factor beside data.
Note that: The experiment result might be statistical significant but not practical significant. That is why the launching of experiment need to check with business standpoint.
- Size and power trade-off:
Before running experiment, we need to decide how many sample we need to collect. The higher the size the lower the statistical power.
Alpha or significance level is the conditional probability of rejecting null when null hypothesis is actually true. Beta or power level is the conditional probability of failing to reject null when null is actually false. 1-beta is sensitivity or statistical power.
Alpha or significance level — P(reject null|null true) = 1 — confidence level. Often choose alpha = 5%
Beta — P (fail to reject| null false)
1-beta: sensitivity or statistical power. Often choose 1-beta = 80%.
Small sample: alpha low, beta high
Large sample: alpha high, beta low
Part 2: Policy & Ethical part
Experiment involves real people. It is important to protect people and follow the ethics. When running and designing the experiment, we should consider 4 principles:
- Risk: What risk is the participant being exposed to?
- Benefit: What benefit might be the outcome of the study?
- Choice: What other choices do participant have?
- Privacy: What expectation of privacy and confidentiality do participants have?
They are also principles that IRB — Institution Review Board is looking for. IRB review possible experiments and ensure that participants are adequately protected. Not all of the cases should be subject to IRB review. But at least we should have internal reviews of all proposed studies by experts regarding the questions:
- Are participants facing more than minimal risk?
- Do participants understand what data is being gathered?
- Is that data identifiable?
- How is the data handled?
And if enough flags are raised, that an external review happen.