1. Different hypothesis Test for the mammography experiment
In the last lecture, we covered the basics of hypothesis testing with the HIP mammography study as our example. The study's aim is to determine whether offering mammographies for breast cancer detection reduces the rate of death due to breast cancer. There are 31000 individuals in each of the treatment and control groups; only those in the treatment groups are offered mammographies.
We recap the elements of the hypothesis testing framework. In the mammography study, they are:
- the Parametric Model. We can write indicator variables for whether each patient in the treatment group dies of breast cancer as X_1, \ldots , X_{31000} \stackrel{i.i.d.}{\sim } \text {Bernoulli}(\pi ), and we can also approximate the total number of deaths as Y = X_1 + \ldots + X_{31000} \sim \text {Poisson}(\lambda ).
- The null hypothesis H_0: \pi = 0.00203 (equivalently \lambda = 63, and the alternative hypothesis H_ A: \pi < 0.00203(equivalently \lambda < 63). We then decide whether or not to reject the null hypothesis based on a test.
- The test statistic T. We define T to simply be the number of deaths Y in the treatment group. Under H_0, it is distributed as T \sim \text {binomial}(31000, 0.00203). This distribution can also be approximated as T \sim \text {Poisson}(63). The role of T is to distinguish between H_0 and H_ A.
- The significance level \alpha = 0.05. This is the probability of rejecting the null hypothesis H_0 when it is in fact true (type I error), that is, the probability of concluding there is an effect when there is none. Generally, the threshold of the test statistic for rejecting the null hypothesis is set based on a chosen significance level.
- The p-value p. This is the probability that the test statistic, under the null hypothesis, takes a value more extreme (towards the direction of the alternative hypothesis) than the one observed. This probability can be computed from the test statistic T and the given parametric model. The p-value varies with the observed value of data, and when p<\alpha, the H_0 is rejected.
- The power of the test. This is the probability of rejecting H_0 when H_ A is true (avoiding a type II error: 1 - P(\text {type II error})). It is useful to write the power as a function of the parameter, when more than one parameter value is considered for H_ A.
Throughout the hypothesis test, we focused on the observed death rate in the treatment group as the variable, and compare it to \pi = 0.00203, the observed death rate in the control group. The question below examines the validity of this approach.
2. Hypergeometric probability distribution
The hypergeometric distribution is a discrete distribution based on the following probability problem:
“Suppose there are N balls in a bowl, K of which are red and the remaining N-K of which are blue. From the bowl, nballs are drawn without replacement. What is the probability that among the n balls drawn, exactly x are red?"
The solution to this problem is given by the following pmf:
|
\displaystyle \displaystyle \mathbb {P}(X = x) |
\displaystyle = \frac{\left(\text {Number of ways to choose } x \text { out of } K \text { red balls} \right) \cdot \left(\text {Number of ways to choose } n-x \text { out of } N-K \text { blue balls } \right)}{\text {Number of ways to choose } n \text { balls out of} N} |
|
|
|
|
\displaystyle = \frac{\dbinom {K}{x}\dbinom {N-K}{n-x}}{\dbinom {N}{n}}. |
|
|
This pmf defines the hypergeometric distribution \text {Hypergeometric}(N, K, n) with the three parameters:
- N, size of population (number of balls in bowl)
- K, size of sub-population of interest (number of red balls in bowl)
- n, the number of targeted outcomes (total number of balls drawn).