import random random.seed(1) import matplotlib.pyplot as plt from collections import defaultdict import scipy import scipy.optimize import numpy as np values = [0, 1] ans = defaultdict(int) for N in range(100000): tot = 0 for i in range(1000): tot += random.choice(values) ans[tot] += 1 start, end = 350, 650 x = np.arange(start, end) y = [] for i in range(start, end): if i in ans: y.append(ans[i]) else: if i > start: y.append(y[-1]) else: y.append(0) n = len(x) mean = 500. sigma = 10. def gaus(x,a,sigma): return a*scipy.exp(-(x-mean)**2/(2*sigma**2)) popt,pcov = scipy.optimize.curve_fit(gaus,x,y,p0=[1.0,sigma]) print popt plt.plot(x,y,'b+:',label='data') plt.plot(x,gaus(x,*popt),'ro:',label='fit') plt.legend() plt.show()PMC points out that a slightly easier way of getting the same value would be to use an equation: variance = n(1-p)p where n = 1000 and p is 0.5. The standard deviation is then the square root of this (i.e. sqrt of 250), 15.8.
However, the proper way of handling the original question is a power calculation. Typically power calculations are associated with calculating the necessary sample size. Here I have a fixed sample size (n=1000) and I want to figure out, given that the alternative hypothesis is true (biased coin), what number of heads must I observe to be 99.9% sure (power=0.999) that it would be show up as significant at the 0.1% level (alpha=0.001). The y = 0.5 below refers to the proportion of heads expected under the null hypothesis (unbiased coin). Since n=1000, we can use the normal approximation to the binomial distribution.
import math import scipy.stats def qnorm(p): return scipy.stats.norm.ppf(p) def solvequadratic(a, b, c): return [(-b+math.sqrt((b**2)-(4*(a*c))))/(2.*a), (-b-math.sqrt((b**2)-(4*(a*c))))/(2.*a)] if __name__ == "__main__": alpha = 0.001 # Significance level gamma = 0.999 # Power qgamma = qnorm(gamma) qalpha = qnorm(1.-(alpha/2.)) p = (qalpha+qgamma)**2 n = 1000 y = 0.5 a = 2*n+p b = (2*p-4*n)*y - 2*p # Note that if y is 0.5, then, b = -2*n-p (i.e. -a) c = -2*p*y + (2*n+p)*y*y print solvequadratic(a, b, c)...and the answer? 642. Thanks to ALC for help with this, though all errors are my own.
Notes: (added 04/12/2015)
Both the power and the significance level are conditional probabilities:
- The significance level (alpha) is the probability of rejecting the null hypothesis given that it is true, ie. prob(rejecting null hypothesis | null hypothesis is correct).
- The power (gamma) is the probability of detecting a difference, if indeed the difference does exist, ie. prob(rejecting null hypothesis | null hypothesis is incorrect).
2 comments:
Nice post.
As an aside you may want to reconsider your quadratic solver. Here's a link to explain the danger of the simple approach. http://rosettacode.org/wiki/Roots_of_a_quadratic_function
Good to know.
Post a Comment