Revision as of 19:57, 28 January 2022

Name: Sample size calculation for A/B test
Author: Vaso Dzhinchvelashvili Method: Monte Carlo
Tool: Python

Some theory

A/B test is a test enabling to see how the feature influenced the performance (some target metric)

α (Alpha) is the probability of Type I error in any hypothesis test–incorrectly rejecting the null hypothesis

β (Beta) is the probability of Type II error in any hypothesis test–incorrectly failing to reject the null hypothesis. (1 – β is power).

Problem definition

Imagine you are the analyst and there is a real problem: you need to understand how many observations do you need, and how long should you conduct an experiment

Limitations

Normally, there are many experiments held within a same company, therefore, for a purity of an experiments, users should not be intersected (cannot participate in two different experiments at a same time) => longer you conduct an experiment, more experiments are getting postponed, therefore development of a product is stopped/slowed down.

So I hardcoded 3 months as a maximum length of an experiment, this way there is lower chance calculations will take an inappropriate time.

Let’s assume there is an ‘old’ feature, and ‘new’ feature is developed to replace the old one. The company needs to decide which feature should be used and which one should be sunsetted (exluded from the product). Both features cannot exist outside the experiment.

Goal

Your goal(as an analyst who uses the calculator) is to understand if a new feature is increasing/decreasing the target metric. You want an assumption to be statistically significant.

What you have

In real life situation there could be the following inputs:

1 Historical information on how users performed in the past for the old feature. Expectation of the metric Variation of the metric

2 How many users access the feature monthly

3 Also, you have some assumptions: you expect a new feature to increase/decrease the metric by 0-x% (both sides). Of course, you want your metric to skyrocket (+10000%) but in reality, you don’t expect more than, say, 20% raise. As mentioned, you can’t hold an experiment longer then 3 months. You might of course. But for the sake of evaluation of my work in somewhat appropriate time (calculation takes some time) I hardcoded the maximum length. But it could be changed in the code.

4 Lastly, there is some chance you are ready to take, to be wrong when assuming a difference was random/not random (i.e. because of a new feature) – normally 5%.

These parameters should be entered in the UI (how explained below)

Model

The calculator uses Monte Carlo method to calculate the chances to see a nonrandom difference in means for samples.

Also, the script uses 2 different formulas and 1 python function to calculate sample size needed to achieve some confidence level (as turned out, all the formulas are tuned for 80% accuracy, however, from the literature review it is not obvious where is a betta, as all the Z/T and other statistics have only alpha in the formulas).

How to run the simulation

1) Add UI-Copy1 clean file to a Jupiter notebook:

2) Run the first block of code:

It should take not more then 20-30 seconds. 3) Run the second block of code:

It should also be launched pretty fast. As a result, the following UI should be visible. (I didn’t have to install any other software/libraries, but read on the internet someone had problems):

As it can be seen, all the parameters from the real life problem can be inserted to the calculator.

Difference between revisions of "A b test"

Revision as of 19:57, 28 January 2022

Contents

Some theory

Problem definition

Limitations

Goal

What you have

Model

How to run the simulation

Navigation menu

Personal tools

Namespaces

Variants

Views

More

Search

Navigation

Contents

Tools

@@ Line 1: / Line 1: @@
-Sample text
+'''Name:''' Sample size calculation for A/B test<br>
+'''Author:''' Vaso Dzhinchvelashvili
+'''Method:''' Monte Carlo<br>
+'''Tool:''' Python<br><br>
+=Some theory=
+A/B test is a test enabling to see how the feature influenced the performance (some target metric)
+α (Alpha) is the probability of Type I error in any hypothesis test–incorrectly rejecting the null hypothesis
+β (Beta) is the probability of Type II error in any hypothesis test–incorrectly failing to reject the null hypothesis.  (1 – β is power).
+=Problem definition=
+Imagine you are the analyst and there is a real problem: you need to understand how many observations do you need, and how long should you conduct an experiment
+==Limitations==
+Normally, there are many experiments held within a same company, therefore, for a purity of an experiments, users should not be intersected (cannot participate in two different experiments at a same time) => longer you conduct an experiment, more experiments are getting postponed, therefore development of a product is stopped/slowed down.
+So I hardcoded 3 months as a maximum length of an experiment, this way there is lower chance calculations will take an inappropriate time.
+Let’s assume there is an ‘old’ feature, and ‘new’ feature is developed to replace the old one. The company needs to decide which feature should be used and which one should be sunsetted (exluded from the product). Both features cannot exist outside the experiment.
+==Goal==
+Your goal(as an analyst who uses the calculator) is to understand if a new feature is increasing/decreasing the target metric. You want an assumption to be statistically significant.
+==What you have==
+In real life situation there could be the following inputs:
+Historical information on how users performed in the past for the old feature.
+Expectation of the metric
+Variation of the metric
+How many users access the feature monthly
+Also, you have some assumptions: you expect a new feature to increase/decrease the metric by 0-x% (both sides). Of course, you want your metric to skyrocket (+10000%) but in reality, you don’t expect more than, say, 20% raise.
+As mentioned, you can’t hold an experiment longer then 3 months. You might of course. But for the sake of evaluation of my work in somewhat appropriate time (calculation takes some time) I hardcoded the maximum length. But it could be changed in the code.
+Lastly, there is some chance you are ready to take, to be wrong when assuming a difference was random/not random (i.e. because of a new feature) – normally 5%.
+These parameters should be entered in the UI (how explained below)
+=Model=
+The calculator uses Monte Carlo method to calculate the chances to see a nonrandom difference in means for samples.
+Also, the script uses 2 different formulas and 1 python function to calculate sample size needed to achieve some confidence level (as turned out, all the formulas are tuned for 80% accuracy, however, from the literature review it is not obvious where is a betta, as all the Z/T and other statistics have only alpha in the formulas).
+=How to run the simulation=
+)	Add UI-Copy1 clean file to a Jupiter notebook:
+)	Run the first block of code:
+It should take not more then 20-30 seconds.
+)	Run the second block of code:
+It should also be launched pretty fast.
+As a result, the following UI should be visible. (I didn’t have to install any other software/libraries, but read on the internet someone had problems):
+As it can be seen, all the parameters from the real life problem can be inserted to the calculator.