Choosing & Characterize metrics

Summary of A/B test course by Google on Udacity (Part 2)

In the previous post, I have summarized the first 2 lectures of the course A/B testing by Google on Udacity. The lectures remind us about statistic knowledge specifically binomial distribution. Also policy and ethical concern should be considered when running and designing experiment. In this post, I will continue the 3rd lecture. It is about choosing and characterizing metrics.

There are 3 factors in choosing & characterizing metrics: Define, build intuition and characterize.

1. Define

There are 2 kinds of metrics — Invariant and Evaluation metrics.

1.1. Invariant or sanity checking metrics : Metrics should not change across experiment and control. We can have multiple metrics to measure in invariant metrics.

1.2. Evaluation: It is kind of metrics that should be changed. It usually the business metrics, like market share, number of users, or user experience metrics etc.,. It can be one or multiple metrics depending on company’s culture. If you have multiple metrics, you can combine into one composite metrics — overall evaluation criterion (OEC). OEC term comes from Microsoft for weighted function that combine different metrics (Paper detail). Sometimes because we don’t have access to data or it takes too long to collect then metrics become difficult. Eg: Does student land a job after completing online course or average happiness of shopper are two examples of difficult metrics. Difficult metrics require special techniques such as:

  • External data: external data can be also used to validate metrics
  • User experience research (UER) : Good for brainstorm, can use special equipment but cannot validate result
  • Focus group: get feedback for hypothetical, but run the risk of group think
  • Survey: useful for metrics you cannot directly measure, but cannot directly compare to other results, bias risk.
  • Retrospective analysis: use your historical data. It can draw correlations. For causation, you might want to try experiments.

2. Build intuition about our data: To decide filtering and segmenting data or not.

Some common distributions when looking at real user data: Poisson, exponential, power-law etc. Sometimes it is hard to detect the distribution. The key here is not to necessarily come up with a distribution to match if the answer is not clear — that can be helpful — but to choose summary statistics that make the most sense for what you do have. If you have a distribution that is lopsided with a very long tail, choosing the mean probably does not work for you very well — and in the case of something like the Pareto, the mean may be infinite.

Four simple categories of summary metrics:

  • Sums and counts
  • Means, medians and percentile
  • Probabilities and rates
  • Ratios

When we choose summary metrics, we need to think about sensitivity and robustness of metric. The idea is that you want to choose the metric that pick up the change you care about & the robust against of change that you don’t care about. Two ways to measure sensitivity & robustness: Run simple experiments or retrospective analysis.

3. Characterize: Measure of spread.

We want to know range of possible condition that metrics can be used, that is variability or measure of spread.

There are 2 kinds of variability. One assumes the underlying distribution of data then we can computed using theoretically computed confidence interval. When the distribution is weird or sample size is too small (less than 30), we can use the second kind — empirical non-parametrics estimate. Non-parametrics test is a way of analyzing data without making assumption about distribution. But often we cannot use empirical because it underestimates the change. One alternative is using A/A testing. A/A is using the same technique as A/B testing to test 2 identical versions against each other.

A/A test used when:

  • Compare results to what you expect (sanity check)
  • Estimate variance and calculate confidence
  • If no assumption about distribution, directly estimate confidence interval.

It is better to run multiple A/A experiment. A/A therefore usually requires to have large sample size. If in case we cannot for some reasons, we can run one big A/A test then using bootstrap. It is when you run 1 big sample then randomly chunk it up into small samples then comparing within random subsets. We do not always use bootstrap if it does not agree with our analytical variance. We have to use big experiment instead.