Using confidence intervals is one approach to identify if the proposed version of an A/B test is achieving business objectives relative to baseline. This post discusses what confidence intervals are, how to interpret them and the impact of sample sizes on the confidence intervals.

**Background Context**

Let us suppose that you are doing an A/B test for an online ad. The goal of the test is to see if a change you are considering will improve click-through rates (CTR). You run the test till you have served \(1000\) impressions for each version and you obtain the results shown in the table below:

[table id=2 /]

How do you know if version B *really* better than version A? It is possible that version B got \(17\) more clicks simply due to random chance and that in reality version B is no better than version A at generating clicks.

One approach to answering the above question involves the use of confidence intervals. Using R (e.g., `prop.test(c(37,20), c(1000, 1000))`

) will give you the \(95\%\) confidence interval as \((0.14\%, 3.26\%)\). Other software packages may give you slightly different answers depending on how they choose to compute confidence intervals. In this post, we will try to understand what confidence intervals are and how to interpret the confidence interval that we obtained above.

**Definition of Confidence Interval**

The first two sentences at Wikipedia for a confidence interval is given below:

In statistics, a confidence interval (CI) is a type of interval estimate (of a population parameter) that is computed from the observed data. The confidence level is the frequency (i.e., the proportion) of possible confidence intervals that contain the true value of their corresponding parameter.

Let us focus on the first sentence (paraphrased): ‘A confidence interval is an interval estimate for a population parameter that is computed from observed data.’ In our context, the population parameter is the true level of improvement in CTR between version B vs version A. Thus, a confidence interval gives us an estimate of where the true level of improvement in CTR is likely to be. In our specific context, our confidence interval of \((0.14\%, 3.26\%)\) suggests that the true level of improvement could be between \(0.14\%\) and \(3.26\%\).

However, the confidence interval we computed is based on observed data which is a random quantity. Our observed clicks could potentially have been different depending on which consumers see our ads. Therefore, the confidence interval itself is random. In order to better understand the randomness of confidence intervals, let us simulate some data and see what happens when we compute confidence intervals.

**The Simulation**

Let us pretend for a moment that the true level of CTR for versions A and B are \(2.00 \%\) and \(3.50 \%\) respectively. Therefore, the true level of improvement in CTR is \(1.50 \%\). We use these values to simulate \(1000\) impressions and calculate what proportion of these impressions are clicks. In our simulation, we will assume that whether a consumer clicks on our ad or not is random with the caveat that the long-run proportion of consumers who click version A ad is \(2.00 \%\) and the long-run proportion of consumers who click version B ad is \(3.50 \%\). Specifically, we simulate our A/B experiment \(10,000\) times where in each experimental run we simulate \(1000\) clicks, compute the number of clicks we receive and compute the corresponding \(95\%\) confidence intervals.

**Simulation Results**

The figure below shows the computed confidence intervals for \(20\) runs out of the \(10,000\) simulations.

Some comments about the figure follow:

- The true level of improvement \(1.50\%\) is shown in blue. The confidence intervals shown in green contain the true value whereas the confidence interval shown in red does
*not*contain the true value. - In the simulation, \(94.95 \%\) of the computed confidence intervals contain the true value whereas the remaining \(5.05\%\) confidence intervals do
*not*contain the true value.

These simulation findings help us understand the second statement on Wikipedia about confidence intervals: ‘The confidence level is the frequency (i.e., the proportion) of possible confidence intervals that contain the true value of their corresponding parameter.’ In our simulation, we computed \(95 \%\) confidence intervals and hence the proportion of confidence intervals that contain the true value is \(94.95\%\). If we had run more simulations (say \(100,000\) instead of \(10,000\)) then the proportion of confidence intervals in our simulation that contain the true value would be closer to \(95\%\).

**Correct Interpretations of Confidence Interval**

In reality, we do not know the true value of improvement in CTR and we run the A/B test just once. Thus, we do not know if the confidence interval we obtain contains the true value or not. However, the simulation study suggests the following interpretation of the confidence interval:

There is a \(0.95\) probability that the confidence interval \((0.14\%, 3.26\%)\) contains the true value of improvement in CTR.

All of the following are incorrect interpretations of the confidence intervals:

- There is a \(0.95\) probability that the true value of improvement is in the confidence interval \((0.14\%, 3.26\%)\).
- The true value of improvement is in the confidence interval.
- The higher the sample size the greater is our confidence that the true value of improvement is in the confidence interval.

If you look at the figure above, the true value of improvement (shown by the blue line) is not random at all. In contrast, the confidence intervals are the ones that are random as they fluctuate depending on the data that we obtain.

There is no way to infer if the true value is in the confidence interval or not. If we assume that it is in the confidence interval then there is a \(5\%\) chance that we made a mistake as there is a \(5\%\) chance that the computed confidence interval does not in fact have the true mean.

In order to better understand the impact of sample size on our conclusions, let us repeat the simulations with different sample sizes. The figure below shows the percentage of \(95\%\) confidence intervals that contain the true mean at different sample sizes.

Given our understanding of confidence intervals, it should not come as a surprise that the percentage of confidence intervals that contain the true mean stays at \(95\%\) irrespective of sample size. Therefore, if we assume that the true value of improvement in CTR is in the confidence interval then the chances of making a mistake stay at \(5\%\) irrespective of sample size.

**Impact of Sample Size on Confidence Intervals**

While higher sample sizes do not influence the chances of making an incorrect conclusion they do have an impact on the range of plausible values for the true value of improvement in CTR. The figure below shows the margin of error (which equals half of the width of the confidence interval) as a function of the sample size.

The figure shows that the margin of error decreases as sample size increases. Thus, the benefit of a higher sample size is clear. If we are willing to assume that the true value of improvement in CTR is in the confidence interval then we obtain a narrower range of plausible values for the true value of improvement at higher sample sizes.

**Conclusion**

In summary, the following are the key takeaways:

- A \(95\%\) confidence interval contains the true value with \(95\%\) probability.
- If we assume that the true value is in the interval then there is a \(5\%\) chance that we are mistaken. The chances of being mistaken do not change with sample size.
- We cannot say that the probability of the true value being in the confidence interval is \(95\%\) as the true value is assumed to be an unknown constant and is not a random variable.
- As sample size increases, the width of confidence interval (i.e., the margin of error) decreases and hence we have a narrower range of plausible values for the true value.