Hypothesis Testing
Board Coverage
| Board | Paper | Notes |
|---|---|---|
| AQA | Paper 1, 2 | Binomial tests in P1; normal tests in P2 |
| Edexcel | P1, P2 | Similar |
| OCR (A) | Paper 1, 2 | Includes critical regions |
| CIE (9709) | P1, P6 | Basic hypothesis testing in P6 |
Hypothesis testing requires clear, structured answers. Always state your hypotheses, test statistic, critical value/region, comparison, and conclusion in context.
1. Hypotheses
1.1 Null and alternative hypotheses
Definition.
- The null hypothesis is the default assumption (usually "no effect" or "no change").
- The alternative hypothesis is what we are trying to find evidence for.
1.2 One-tailed and two-tailed tests
- One-tailed: (right-tailed) or (left-tailed).
- Two-tailed: .
The choice depends on the research question. Use a one-tailed test only when you have a specific directional prediction before seeing the data.
Choosing a one-tailed test after seeing the data (because the results happen to go in one direction) is a form of -hacking and is statistically invalid. The tail direction must be decided before the experiment.
2. Critical Values and Significance Levels
2.1 Significance level
Definition. The significance level is the maximum probability of incorrectly rejecting when it is true. Common values: 1%, 5%, 10%.
2.2 Critical value
The critical value is the boundary between the acceptance and rejection regions.
2.3 Critical region
The critical region (rejection region) is the set of values of the test statistic that lead to rejection of .
2.4 Actual significance level
For discrete distributions, the actual significance level may differ from the nominal level because we cannot achieve exactly .
Example. For , a right-tailed test at :
. We find the smallest such that .
, .
Critical region: . Actual significance level: 1.76%.
3. Type I and Type II Errors
3.1 Definitions
Definition.
-
Type I error: Rejecting when is true (false positive).
-
Type II error: Failing to reject when is false (false negative).
-
The power of a test is .
3.2 Relationship
Decreasing (making the test stricter) generally increases (more false negatives). There is always a trade-off between Type I and Type II errors.
Intuition. Think of a courtroom: a Type I error is convicting an innocent person; a Type II error is acquitting a guilty person. Making the standard of proof higher (beyond reasonable doubt) reduces Type I errors but increases Type II errors. You cannot eliminate both simultaneously.
4. Hypothesis Testing Procedure
4.1 Standard method
- Define the random variable and its distribution under .
- State and .
- State the significance level .
- Calculate the critical region (or critical value).
- Determine the test statistic from the data.
- Compare the test statistic to the critical value.
- Conclude in context.
4.2 Using -values
Alternatively: 1–3. Same as above. 4. Calculate the -value: the probability of obtaining a result at least as extreme as the observed value, assuming is true. 5. If -value , reject . Otherwise, do not reject . 6. Conclude in context.
5. Binomial Hypothesis Tests
5.1 Single proportion test
Example. A coin is tossed 20 times and lands heads 15 times. Test at the 5% significance level whether the coin is biased towards heads.
.
, . One-tailed, .
Under : .
Find such that .
. .
Critical region: . Since is in the critical region, we reject .
There is sufficient evidence at the 5% level that the coin is biased towards heads.
6. Normal Hypothesis Tests
6.1 Test for a mean (known variance)
Example. A machine fills bags with mean weight 500g. A sample of 30 bags gives g. Test at the 5% level whether the mean weight has decreased, given g.
, . .
Under : .
.
Critical value: .
Since , we reject .
There is sufficient evidence that the mean weight has decreased.
6.2 Large sample test for a proportion
For large : approximately.
Test statistic: .
7. Interpreting Results
"Failing to reject " is not the same as "proving is true." It means the data does not provide sufficient evidence against . The test may lack power (sample too small, effect too weak).
8. One-Tailed vs Two-Tailed Tests in Depth
8.1 Choosing between one-tailed and two-tailed
Use a one-tailed test when:
- The research question has a specific directional prediction established before data collection.
- Only one direction of deviation is practically meaningful.
- The consequence of missing an effect in the unexpected direction is negligible.
Use a two-tailed test when:
- You are interested in any difference from , regardless of direction.
- You want a more conservative test that is harder to reach significance with.
- There is no strong prior reason to expect the effect in one specific direction.
Example. Testing whether a new teaching method changes exam scores:
- One-tailed (): justified only if prior research strongly suggests the method improves scores, and you would not act on a decrease.
- Two-tailed (): appropriate if the method is new and could either help or harm, and either outcome matters.
8.2 Critical region comparison
For a test at significance level , the allocation of the significance level differs:
- One-tailed: The entire goes into one tail. The critical value is at the quantile (right-tailed) or quantile (left-tailed).
- Two-tailed: goes into each tail. The critical values are at the and quantiles.
This means the two-tailed test has a higher bar for each individual tail.
Example. Standard normal test at :
- One-tailed (): reject if .
- Two-tailed (): reject if or .
An observed is significant for the one-tailed test () but not for the two-tailed test ().
A two-tailed test at level requires a more extreme test statistic than a one-tailed test at the same , because the significance "budget" is split between two tails. A two-tailed test at corresponds roughly to two one-tailed tests each at .
8.3 Effect on power
For the same , a one-tailed test has greater power than a two-tailed test against an alternative in the predicted direction, because the critical value is closer to the null value. However, a one-tailed test has zero power to detect an effect in the opposite direction.
9. Binomial Tests with Normal Approximation
9.1 When to use the normal approximation
When is sufficiently large, the binomial distribution can be approximated by a normal distribution. The standard conditions are:
Under these conditions:
Equivalently, for the sample proportion :
warning ), not the observed sample proportion .
9.2 Continuity correction
Since the binomial distribution is discrete and the normal distribution is continuous, a continuity correction improves the accuracy of the approximation:
- For , use .
- For , use .
- For , use in the normal.
9.3 Worked example
Example. Historically, 40% of students at a school take the bus. In a survey of 120 students, 58 take the bus. Test at the 5% level whether the proportion has changed.
. , . Two-tailed, .
Check conditions using : and . Conditions satisfied.
Under : , so .
Using continuity correction:
Two-tailed critical values: . Since , do not reject .
There is insufficient evidence at the 5% level that the proportion of bus users has changed.
10. Confidence Intervals
10.1 Definition
A confidence interval gives a range of plausible values for a population parameter, together with a specified level of confidence.
Definition. A confidence interval for a parameter is an interval constructed from sample data such that, in repeated sampling, of such intervals would contain the true value of .
A 95% confidence interval does not mean there is a 95% probability that lies in the interval. The parameter is fixed; it either is or is not in the interval. The 95% refers to the long-run proportion of intervals (across many repeated samples) that capture .
10.2 95% confidence interval for a population proportion
For large where and , the sample proportion is approximately normal. The confidence interval for is:
For a 95% confidence interval, :
The margin of error is , which decreases as increases.
10.3 Connection to hypothesis testing
There is a direct and important link between confidence intervals and two-tailed hypothesis tests:
- A confidence interval contains exactly those values of that would not be rejected by a two-tailed test of at level .
- If falls outside the confidence interval, then is rejected at level .
- If falls inside the confidence interval, then is not rejected at level .
Example. Using the bus survey data: , .
Since lies inside , we do not reject at the 5% level. This is consistent with the hypothesis test result in Section 9.3.
11. Interpreting p-Values
11.1 Formal definition
Definition. The -value is the probability of obtaining a test statistic at least as extreme as the observed value, assuming is true.
For a two-tailed test, "at least as extreme" means at least as far from the null value in either direction, so the -value is doubled.
11.2 Decision rule
- If : reject . The result is statistically significant.
- If : do not reject . The result is not statistically significant.
11.3 Strength of evidence
The smaller the -value, the stronger the evidence against :
| -value range | Strength of evidence against |
|---|---|
| Little to no evidence | |
| Weak evidence | |
| Moderate evidence | |
| Strong evidence | |
| Very strong evidence |
11.4 Common misinterpretations
- The -value is not the probability that is true.
- The -value is not the probability that the observed result occurred by chance.
- A large -value does not prove is true; it only means the data is consistent with .
- Statistical significance does not imply practical or scientific importance.
- The -value depends on sample size: with a very large sample, even trivially small effects can produce tiny -values.
11.5 Worked example
Example. A factory produces components with mean length 50 mm. A sample of 40 components gives mm. Given mm, find the -value for testing vs .
Under : .
Since , we do not reject at the 5% level.
Interpretation: If the true mean were 50 mm, there would be approximately a 9.2% chance of observing a sample mean at least as far from 50 mm as 50.8 mm. This is not unusual enough to provide convincing evidence against .
Problem Set
Details
Problem 1
A die is rolled 60 times and a 6 appears 16 times. Test at the 5% level whether the die is biased.Details
Solution 1
. , . Two-tailed, .Under : . , .
Using normal approximation: .
Two-tailed: critical values . , so reject .
There is evidence at the 5% level that the die is biased.
If you get this wrong, revise: Binomial Hypothesis Tests — Section 5.
Details
Problem 2
A manufacturer claims that 90% of their products pass quality control. In a sample of 200, 170 pass. Test the claim at the 5% significance level.Details
Solution 2
. , . Left-tailed, ..
Under : .
.
Critical value: . Since , reject .
There is evidence that the proportion passing quality control is less than 90%.
If you get this wrong, revise: Normal Hypothesis Tests — Section 6.
Details
Problem 3
Explain the difference between a Type I error and a Type II error in the context of medical testing.Details
Solution 3
Type I error: The test says a healthy person is sick (false positive). This leads to unnecessary treatment and anxiety.Type II error: The test says a sick person is healthy (false negative). This means the person goes untreated, potentially with serious consequences.
If you get this wrong, revise: Type I and Type II Errors — Section 3.
Details
Problem 4
Find the critical region for a test of vs using at the 5% level.Details
Solution 4
Under : .. .
Critical region: . Actual significance level: 4.73%.
If you get this wrong, revise: Critical Region — Section 2.3.
Details
Problem 5
The mean lifetime of a bulb is claimed to be 1000 hours. A sample of 50 bulbs gives hours with hours. Test at the 1% level whether the mean lifetime is less than 1000 hours.Details
Solution 5
, . .approximately.
.
Critical value at 1%: . Since , reject .
There is evidence at the 1% level that the mean lifetime is less than 1000 hours.
If you get this wrong, revise: Normal Hypothesis Tests — Section 6.
Details
Problem 6
For , find the critical region for a two-tailed test at the 10% significance level.Details
Solution 6
Under : .For each tail, we need and .
Lower: , . So . Upper: , . So .
Critical region: or . Actual significance level: .
If you get this wrong, revise: Critical Values and Significance Levels — Section 2.
Details
Problem 7
A teacher claims that the average score on a test is 70%. In a class of 25, the mean score is 66% with standard deviation 12%. Test at the 5% level.Details
Solution 7
, . Two-tailed, .approximately.
.
Two-tailed critical values: . , so do not reject .
There is insufficient evidence at the 5% level that the mean score differs from 70%.
If you get this wrong, revise: Normal Hypothesis Tests — Section 6.
Details
Problem 8
A drug is effective for 60% of patients. After a new treatment, 18 out of 25 patients are cured. Test whether the new treatment is more effective at the 5% level.Details
Solution 8
. , . Right-tailed, .Under : .
. .
Critical region: . Since , do not reject .
Insufficient evidence that the new treatment is more effective.
If you get this wrong, revise: Binomial Hypothesis Tests — Section 5.
Details
Problem 9
Explain why failing to reject does not mean is true.Details
Solution 9
Failing to reject means the data is consistent with but does not prove it. The test may lack sufficient power to detect a real effect. For example, if a drug has a small but real benefit, a small sample may not detect it, leading us to fail to reject even though the drug is effective. The absence of evidence is not evidence of absence.If you get this wrong, revise: Interpreting Results — Section 7.
Details
Problem 10
For a test of vs at the 5% level with and , find the probability of a Type II error if the true mean is .Details
Solution 10
Under : .Critical value: .
Type II error = failing to reject when .
under the true distribution.
.
So and the power is .
If you get this wrong, revise: Type I and Type II Errors — Section 3.
Details
Problem 11
A researcher tests whether a new drug changes recovery time. She uses a two-tailed test of vs at and obtains . (a) What is her conclusion? (b) If she had instead used a right-tailed test at the same level, would her conclusion change? Explain.Details
Solution 11
(a) Two-tailed test: critical values . , so do not reject . There is insufficient evidence that recovery time has changed.(b) One-tailed test: critical value . Since , we reject . There is sufficient evidence that recovery time has increased.
The conclusion changes because a one-tailed test allocates the entire 5% significance level to one tail, making the critical value less extreme. This illustrates why the choice between one-tailed and two-tailed must be made before seeing the data.
If you get this wrong, revise: One-Tailed vs Two-Tailed Tests in Depth — Section 8.
Details
Problem 12
A survey of 200 households in a town finds that 45 regularly recycle. The national recycling rate is 20%. Test at the 5% level whether the recycling rate in this town differs from the national rate, using a normal approximation with continuity correction.Details
Solution 12
. , . Two-tailed, .Check conditions using : and . Conditions satisfied.
Under : , .
Using continuity correction:
Two-tailed critical values: . , so do not reject .
There is insufficient evidence at the 5% level that the recycling rate differs from 20%.
If you get this wrong, revise: Binomial Tests with Normal Approximation — Section 9.
Details
Problem 13
In a random sample of 150 voters, 87 support a new policy. (a) Construct a 95% confidence interval for the true proportion of support. (b) Since the interval does not contain 0.5, a politician claims "a majority of voters support the policy." Is this claim justified?Details
Solution 13
(a) .Check: and .
(b) The 95% CI is . Since the entire interval lies above 0.5, we can reject at the 5% level. However, the lower bound is only 0.501, so the evidence for a majority is borderline. The claim is technically supported by the test, but the narrow margin should be communicated carefully.
If you get this wrong, revise: Confidence Intervals — Section 10.
Details
Problem 14
A 95% confidence interval for a population mean is . State whether would be rejected or not rejected at the 5% level for each of the following null values: (a) , (b) , (c) . Justify using the connection between confidence intervals and hypothesis tests.Details
Solution 14
A 95% confidence interval contains exactly those values of that would not be rejected by a two-tailed test at the 5% level.(a) : , so do not reject . (b) : , so reject . (c) : , so reject .
If you get this wrong, revise: Confidence Intervals — Section 10.
Details
Problem 15
A sample of 35 students has mean score 62.4 with known population standard deviation . (a) Find the -value for testing vs . (b) State your conclusion at the 5% significance level and interpret the -value.Details
Solution 15
(a) Under : .(b) Since , reject at the 5% level. There is sufficient evidence that the true mean score exceeds 60. The -value of 0.038 means that if the true mean were 60, there would be a 3.8% chance of observing a sample mean of 62.4 or higher. This provides moderate evidence against .
If you get this wrong, revise: Interpreting p-Values — Section 11.
Details
Problem 16
For a test of vs with , , and : (a) Find the critical value in terms of . (b) Find the probability of a Type II error and the power of the test if the true mean is . (c) How would the power change if were increased to 0.10?Details
Solution 16
(a) Under : , so .Critical value: . Reject if .
(b) Type II error when : .
under the true distribution.
So and power .
(c) If , the critical value becomes .
Power . Increasing from 0.05 to 0.10 increases the power (from 0.847 to 0.917) but also increases the probability of a Type I error. This illustrates the trade-off between Type I and Type II errors.
If you get this wrong, revise: Type I and Type II Errors — Section 3.
:::
:::
:::
Diagnostic Test Ready to test your understanding of Hypothesis Testing? The diagnostic test contains the hardest questions within the A-Level specification for this topic, each with a full worked solution.
Unit tests probe edge cases and common misconceptions. Integration tests combine Hypothesis Testing with other topics to test synthesis under exam conditions.
See Diagnostic Guide for instructions on self-marking and building a personal test matrix.