Dieu_HOA Nguyen

Jun 28, 2021

5 min read

Summary of A/B test course by Google on Udacity (Part 1)

Currently, I am following the course A/B testing by Google on Udacity. You can find the course link here. The course provide theoretical as well as practical parts so I highly recommend if you can spend your time learning it by yourself. This course is very useful but might be hard to follow if you prefer to learn by reading. That is why I summarize the course according to my understanding. Hope my note will save your time and effort.

My summary in general align with course’s structure. Sometimes I provide extra link or take definition from other source to clarify further.

Part 1: Overview of A/B testing

1. Overview

A/B testing or split test is a general methodology used online when you want to test out new product or a feature. You want to take 2 set of users then show an existing product/feature & a new version of product/feature then comparing how each user set behave different in different variant.

When it is useful and when it is not? As previous CEO of Mozilla’s analogy: A/B testing is useful for helping you climb to the peak of the current mountain but if you want to figure out whether you want to be in this mountain or another mountain, A/B testing is not so useful.

Common threat for A/B testing validity — Novelty effect. Novelty effect is the tendency for performance to initially improve when new technology is instituted, not because of any actual improvement in learning or achievement, but in response to increased interest in the new technology (Wikipedia).

Examples of A/B testing:

Example 1 — Audacity is an education website. The site is considering to change the button color in homepage from orange to pink. In this specific case:

  • Metric choice: Click through probability (CTP) = unique visitors who click/unique visitors to page

Note: Metrics should be practical (not too long to conduct) & suitable to answer hypothesis.

Example 2— An extreme example from Google team. They could not decide between two blues so they tested 41 shades between each blue, showing each shade to 1% of their users to see which one performs better.

2. Statistics reviews

Many A/B tests examine binary outcomes (like buy/don’t buy) and typically use binomial distribution for statistical inference. It is needless to review some of statistics knowledge.

The binomial distribution is a probability distribution that summarizes the likelihood that a value will take one of two independent values under a given set of parameters or assumptions. Three underlying assumption of binomial distribution include: 2 types of outcomes, independent events (mutually exclusive) & identical distribution (same probability).

Binomial distribution curve. Source: from course's video
Binomial distribution curve. Capture from course's video.

Properties of binomial distribution:

  • Std dev = sqrt (p*(1-p)/N)
  • Confidence interval (CI): (Often choose value 95%) If we carry experiment over and over again, we would expect the interval we construct around our sample mean to cover the true value of sample population 95% of the time.

As a rule of thumb: to use normal, check if satisfied: N*p^>5 * N* (1-pˆ) >5

P^ (or estimated probability) = X/N ( X — # users who clicked, N — # users) or center of CI

m (or margin of error) = z* se = z* sqrt (pˆ*(1-pˆ)/N)

z-score depend on confident level (if 95%, z score= 1.96). Check z-score table or simply google it.

CI in [pˆ-m, pˆ+m]

  • Hypothesis testing: Null & alternative hypothesis

Null hypothesis (H0): pexp — pcont = 0

Alternative hypothesis (H1): pexp — pcont # 0

When comparing two samples, it needs to choose standard error that give us good comparison of both. It is then requires to calculate the pooled standard error.

pˆpool = (Xcont +Xexp) / (Ncont +Nexp)

SEpool = sqrt(Pˆpool*(1- Pˆpool)*(1/Ncont+1/Nexp))

d^ (difference) = pexp — pcont

H0: d = 0 or d^ distribute normally or ~ N(0, SEpool)

If estimated dˆ > 1.96* SEpool or dˆ < -1.96* SEpool, we can reject null hypothesis. Or two sample are statistically significant difference, we could launch. Else we cannot reject null hypothesis.

In case, dmin < part of CI range then we conclude that we might need additional test. In practical, we might do not have enough of time and resource to implement another test. We need to communicate with decision maker and sometimes take risks, because data is uncertain. Decision maker might use another factor such as strategic business issue or other factor beside data.

Note that: The experiment result might be statistical significant but not practical significant. That is why the launching of experiment need to check with business standpoint.

Practical example:

Before running experiment, we need to decide how many sample we need to collect. The higher the size the lower the statistical power.

Alpha or significance level is the conditional probability of rejecting null when null hypothesis is actually true. Beta or power level is the conditional probability of failing to reject null when null is actually false. 1-beta is sensitivity or statistical power.

Alpha or significance level — P(reject null|null true) = 1 — confidence level. Often choose alpha = 5%

Beta — P (fail to reject| null false)

1-beta: sensitivity or statistical power. Often choose 1-beta = 80%.

Small sample: alpha low, beta high

Large sample: alpha high, beta low

Part 2: Policy & Ethical part

Experiment involves real people. It is important to protect people and follow the ethics. When running and designing the experiment, we should consider 4 principles:

  • Benefit: What benefit might be the outcome of the study?
  • Choice: What other choices do participant have?
  • Privacy: What expectation of privacy and confidentiality do participants have?

They are also principles that IRB — Institution Review Board is looking for. IRB review possible experiments and ensure that participants are adequately protected. Not all of the cases should be subject to IRB review. But at least we should have internal reviews of all proposed studies by experts regarding the questions:

  • Do participants understand what data is being gathered?
  • Is that data identifiable?
  • How is the data handled?

And if enough flags are raised, that an external review happen.