Statistical Power of Coin Flips

August 29, 2015
pythonstatistics

There are a few topics that I wish were taught in an introduction to statistics undergraduate course. One of those topics is Bayesian Statistics, the other is Statistical Power. In this post, I go through the analysis of flipping coins, and how to calculate statistical power for determining if a coin is biased or fair.

When we flip a coin $n$ times with probability $p$ of getting heads, the probability distribution is binomial. We can apply hypothesis testing where the null hypothesis states: the coin is fair ( $p=0.5$ ), and the alternative hypothesis states: the coin is not fair, ( $p\neq0.5$ ). Additionally, we fix $\alpha=0.05$, meaning when the null hypothesis is true, we will reject it 5% of the time (Type I error). We can find the confidence interval:

from scipy.stats import binom

alpha = 0.05
n = 100 # number of flips
p = 0.5 # fair
binom.interval(1 - alpha, 100, 0.5)
Out[4]:
(40.0, 60.0)

For 100 coin flips, if we get a number of heads between 40 and 60, we "fail to reject the null hypothesis", otherwise we "reject the null hypothesis." With $\alpha=0.05$, we would incorrectly "reject the hypothesis" 5% of the time.

We may further ask, what is the probability that we "reject the null hypothesis" when $p\neq0.5$. Lets say that the coin has bias with $p=0.55$, the probability of not getting 40 to 60 heads is:

# cdf = cumulative distrubution function
1 - (binom.cdf(60, 100, 0.55) - binom.cdf(39, 100, 0.55))
Out[5]:
0.13519227295357183

If $p=0.55$, then there is a 13.5% chance that we will "reject the null hypothesis". In other words, 13.5% of the time we would be able to distinguish between a fair coin from a biased one. This value is commonly known as statistical power. Lets write a function to calculate the statistical power for different values of $p$ and flips, $n$:

def coin_power(n, p, alpha=0.05):
    """Calculates statistical power
    Arguments:
    n -- number of flips
    p -- actual probability for heads
    """
    # confidence interval for p = 0.5 for n flips
    lower, upper = binom.interval(1 - alpha, n, 0.5)
    return 1 - (binom.cdf(upper, n, p) - binom.cdf(lower - 1, n, p))

For 100 flips, we plot the statistical power as a function of the actual heads probability, $p$:

import matplotlib.pyplot as plt

fig, ax = plt.subplots(1, 1)
ax.set_title("Power vs Actual Heads Probability for n = 100")

def plot_power(n, ax, **kwargs):
    """Plots power vs actual heads
    Arguments:
    n -- number of flips
    ax -- matplotlib axes
    kwargs -- keyword arguments for plotting
    """
    p_values = np.linspace(0,1,1000)
    ax.plot(p_values, coin_power(n, p_values), **kwargs)
    ax.set_xticks(np.arange(0,1.1, 0.1))
    ax.set_yticks(np.arange(0,1.1, 0.1))
    ax.set_ybound((0, 1.05))
    ax.set_ylabel("Power")
    ax.set_xlabel("Actual Heads Probability")
    
plot_power(100, ax, color='r')

For 100 flips, if the actual heads probability is 0.4, then the power is 0.5, which means we would not be able to tell the different between a bias coin and fair coin 50% of the time. One way to increase the power is to increase the number of flips, n:

fig, ax = plt.subplots(1, 1)

def plot_powers(ax):
    ax.set_title("Power vs Actual Heads Probability")
    plot_power(100, ax, color='r', label="flips = 100")
    plot_power(250, ax, color='b', label="flips = 250")
    plot_power(1000, ax, color='g', label="flips = 1000")
    plot_power(3000, ax, color='y', label="flips = 3000")
    ax.legend(bbox_to_anchor=(0.75, 0.6), loc=2, prop={'size':16})
    
plot_powers(ax)

As we increase the number of flips, the power is greater for broader values of $p$. Lets focus on the region of power greater than 0.9:

fig, ax = plt.subplots(1, 1)
plot_powers(ax)
ax.annotate('Power > 0.9', xy=(0.4758, 0.95), size=16)
ax.fill_between([0, 1.0], [0.9, 0.9], [1.0, 1.0], alpha=0.2)
ax.set_ybound((0.85, 1.01))
ax.set_xbound((0.3, 0.7))
ax.set_xticks(np.arange(0.30,0.70,0.04));

Recall that power is the probability that we can distinguish between a bias coin and a fair coin. By inspecting the above plot, we can make the following approximations:

  • If we flip 100 coins, we get at least 0.9 power for a bias coin where $p<0.34$ or $p>0.66$.
  • If we flip 250 coins, we get at least 0.9 power for a bias coin where $p<0.40$ or $p>0.60$.
  • If we flip 1000 coins, we get at least 0.9 power for a bias coin where $p<0.45$ or $p>0.55$.
  • If we flip 3000 coins, we get at least 0.9 power for a bias coin where $p<0.47$ or $p>0.53$.

With 100 flips, we can only distinguish a very bias coin where $p<0.34$ or $p>0.66$ from a fair coin 90% of the time. With 10 times more flips (1000), we can distinguish a less bias coin where $p<0.45$ or $p>0.55$ from a fair coin 90% of the time.

It is important that experiements have a large enough sample size so that there is enough statistical power to detect differences. If the sample size is too small, we can only detect a difference when there is a massive difference from the norm.

Similar Posts

12/27/23
Python Extensions in Rust with Jupyter Notebooks
08/15/23
Quick NumPy UFuncs with Cython 3.0
05/14/23
Accessing Data from Python's DataFrame Interchange Protocol
09/12/18
Survival Regression Analysis on Customer Churn
07/31/18
Nuclei Image Segmentation Tutorial