Hypothesis Testing — Diagnostic Tests
Unit Tests
Tests edge cases, boundary conditions, and common misconceptions for hypothesis testing.
UT-1: One-Tailed vs Two-Tailed Tests and Critical Region Identification
Question:
A machine fills bags of sugar. The weight of sugar in each bag is normally distributed with mean grams and standard deviation 0.5 g. The machine is set to produce bags with mean weight 500 g. A quality control inspector takes a random sample of 16 bags and measures their mean weight.
(a) The inspector wishes to test whether the machine is underfilling the bags. State the null and alternative hypotheses for a suitable test.
(b) Find the critical region for this test at the 5% significance level. Give your answer in terms of the sample mean .
(c) A different inspector believes the machine could be overfilling or underfilling, and proposes a two-tailed test. Find the critical region for the two-tailed test at the 5% significance level.
(d) The sample mean is found to be g. Carry out both the one-tailed test from part (a) and the two-tailed test from part (c). Show that they lead to different conclusions, and explain why.
(e) Explain why the one-tailed test has a larger critical region in the tail of interest compared to either tail of the two-tailed test, and discuss the implications for Type I and Type II errors.
[Difficulty: hard. Tests the critical distinction between one-tailed and two-tailed tests and their practical consequences.]
Solution:
(a) Let be the population mean weight of sugar in bags.
(the machine is filling correctly)
(the machine is underfilling)
This is a one-tailed test (left-tailed).
(b) Under : .
So .
The critical region is where under :
Critical region: g (to 2 d.p.).
(c) For the two-tailed test at the 5% level, each tail has significance .
, .
Lower critical value:
Upper critical value:
Critical region: or g (to 2 d.p.).
(d) One-tailed test: .
Test statistic: .
Since , the test statistic falls in the critical region.
Conclusion (one-tailed): Reject . There is sufficient evidence that the machine is underfilling.
Two-tailed test: .
Since , the test statistic falls in the lower critical region.
Conclusion (two-tailed): Reject . There is sufficient evidence that the mean fill weight differs from 500 g.
In this case, both tests lead to rejection because the evidence is very strong ( is far in the tail). However, consider if instead:
- One-tailed: . Since , we reject .
- Two-tailed: . Since , we do not reject .
This shows that the one-tailed test is more powerful for detecting deviations in the specified direction, but it should only be used when there is a genuine prior reason to test in one direction only.
(e) The one-tailed test allocates the entire 5% significance level to one tail, giving a critical value of . The two-tailed test splits the 5% equally between both tails, giving critical values of .
In the tail of interest (e.g., the left tail for ):
- One-tailed: critical value
- Two-tailed: critical value
The one-tailed critical region extends further into the tail (includes values between and that the two-tailed test would not catch).
Implications:
- Type I error rate: Both tests have the same overall Type I error rate (5%). The one-tailed test concentrates this 5% in one direction.
- Type II error rate: The one-tailed test has a lower Type II error rate (higher power) for detecting deviations in the specified direction, because its critical region is larger in that direction.
- Risk: The one-tailed test cannot detect deviations in the opposite direction. If the machine is actually overfilling, the one-tailed test (testing for underfilling) will almost certainly fail to reject , committing a Type II error.
UT-2: Type I and Type II Errors — Calculation and Interpretation
Question:
A doctor uses a blood test to diagnose a condition. The test is based on a biomarker level which is normally distributed with mean and standard deviation 8. For healthy individuals, . For individuals with the condition, .
The doctor diagnoses the condition if .
(a) Calculate the probability of a Type I error (diagnosing the condition in a healthy individual).
(b) Calculate the probability of a Type II error (failing to diagnose the condition in an affected individual).
(c) The doctor wants both the Type I error rate and the Type II error rate to be at most 5%. Show that a single threshold value cannot achieve both simultaneously with the given distributions.
(d) The doctor increases the sample size to 4 independent blood tests and uses the mean biomarker level as the test statistic. Find the threshold value such that both the Type I and Type II error rates are at most 5%. If this is not possible, find the threshold that minimises the sum of the two error rates.
[Difficulty: hard. Tests precise calculation of both types of error and the trade-off between them.]
Solution:
(a) A Type I error occurs when a healthy person is diagnosed with the condition.
For healthy individuals: .
The probability of a Type I error is approximately 15.9%.
(b) A Type II error occurs when an affected person is not diagnosed.
For affected individuals: .
The probability of a Type II error is approximately 30.9%.
(c) For the Type I error rate to be at most 5%, we need:
So .
For the Type II error rate to be at most 5%, we need:
So .
But for the Type I error rate to be at most 5%, we need , while for the Type II error rate to be at most 5%, we need . These are contradictory requirements in the sense that increasing reduces the Type I error rate but increases the Type II error rate, and vice versa.
Let me verify: at , Type I error and Type II error . At , Type II error but Type I error .
As increases, Type I error decreases and Type II error increases. There is no value of where both are at most 5% simultaneously.
(d) With independent tests, , so .
For Type I error :
For Type II error :
Wait --- for Type II error : we need :
But at : Type I error , and Type II error .
At : Type I error , Type II error .
Still, there is no value where both are . Even with 4 samples, the distributions overlap too much.
To minimise the sum of error rates, we set the threshold where the two PDFs cross:
At :
Type I error
Type II error
Sum . This is the threshold that minimises the total error probability, giving equal error rates of 6.68% each.
UT-3: P-value Interpretation and "Accepting" the Null Hypothesis
Question:
A researcher tests whether a new drug reduces blood pressure. Under the null hypothesis : the drug has no effect, so the change in blood pressure . The alternative hypothesis is : the drug reduces blood pressure, i.e., with .
The researcher tests the drug on 25 patients and finds the mean reduction in blood pressure is mmHg.
(a) Calculate the p-value for this test.
(b) The researcher writes in her report: "The p-value is greater than 0.05, so we accept the null hypothesis. The drug has no effect." Identify two errors in this statement and rewrite it correctly.
(c) A colleague argues: "Since the p-value is greater than 0.05, the null hypothesis is probably true." Explain why this is incorrect, making reference to what the p-value actually measures.
(d) The researcher doubles the sample size to 50 patients and finds the same mean reduction of mmHg. Calculate the new p-value and explain why it is smaller even though the effect size (mean reduction) is the same.
[Difficulty: hard. Tests the most commonly misunderstood aspects of hypothesis testing: p-value interpretation and the language of conclusions.]
Solution:
(a) Under : , so .
This is a one-tailed test (left-tailed), so the p-value is:
The p-value is approximately 0.0414.
Since , the result is statistically significant at the 5% level.
(b) The researcher's statement contains two errors:
-
"Accept the null hypothesis": We never "accept" . The correct language is "there is insufficient evidence to reject " or "we fail to reject ." The distinction matters because a lack of evidence against is not the same as evidence for . The drug may have a small effect that the test was not powerful enough to detect.
-
"The drug has no effect": Failing to reject does not prove that is true. The correct statement is that "the data does not provide sufficient evidence to conclude that the drug reduces blood pressure." The drug might still have an effect; we simply cannot detect it with this sample size.
Corrected statement: "The p-value is [greater/less] than 0.05, so there is [insufficient/sufficient] evidence at the 5% significance level to reject the null hypothesis that the drug has no effect on blood pressure."
(c) The colleague's statement is incorrect because the p-value does not measure the probability that is true. The p-value is:
The probability of obtaining a test statistic at least as extreme as the one observed, assuming is true.
The p-value is a conditional probability: , not .
A large p-value means the observed data is consistent with , but it does not mean is probably true. The data could also be consistent with a small but non-zero effect. For example, if the true effect is a reduction of 2 mmHg (which is clinically meaningful), a small sample might still produce a large p-value.
To determine would require Bayesian methods (prior probabilities), which go beyond the scope of classical hypothesis testing.
(d) With : , so .
The new p-value is approximately 0.0071, which is much smaller than the original 0.0414.
The p-value decreases because with a larger sample, the standard error is smaller. The same mean reduction of 5.2 mmHg represents a larger number of standard errors away from the null hypothesis value of 0. This makes the evidence against stronger, even though the observed effect size is the same.
This illustrates a fundamental principle: statistical significance depends on both the effect size and the sample size. With a large enough sample, even a very small effect can produce a statistically significant result.
Integration Tests
Tests synthesis of hypothesis testing with other topics. Requires combining concepts from multiple units.
IT-1: Choosing the Correct Distribution for a Test (with Statistical Distributions)
Question:
A traffic engineer monitors a busy junction and records the number of vehicles passing through per 10-second interval. Over a long period, the mean number of vehicles per 10-second interval is 8.
After a new traffic light system is installed, the engineer records the number of vehicles in 20 randomly selected 10-second intervals:
(a) The engineer proposes to use a Poisson distribution to model the data. Before the new system, the number of vehicles . State the mean and variance of , and explain why the Poisson distribution is a reasonable model for this scenario.
(b) The engineer wants to test whether the new traffic light system has changed the mean number of vehicles per interval. She decides to use the sample mean. State why using the sample mean directly with a Poisson test is difficult, and explain why a normal approximation would be appropriate.
(c) Using the sample data, calculate the sample mean and carry out a hypothesis test at the 5% significance level to determine whether the mean number of vehicles has decreased. Use the normal approximation to the Poisson distribution.
(d) An alternative approach is to use the total count and model under the null hypothesis. Carry out this test and show that it gives the same conclusion as part (c).
[Difficulty: hard. Combines distribution selection with hypothesis testing using two equivalent approaches.]
Solution:
(a) For :
The Poisson distribution is appropriate because:
- Vehicles arrive independently at the junction.
- The rate is approximately constant over time.
- Multiple vehicles can pass through in the same interval (not Bernoulli trials).
- The events (vehicle arrivals) are rare relative to the time scale (individual vehicle arrivals are instantaneous events in a continuous time process).
These are the standard assumptions of the Poisson process.
(b) Testing the sample mean directly with a Poisson distribution is difficult because the sum (or mean) of independent Poisson random variables is also Poisson-distributed, but the Poisson distribution is discrete and the critical values must be found from Poisson cumulative probability tables. With a mean of 8 per interval and 20 intervals, the total is Po(160), which is large and would require interpolation in tables.
The normal approximation is appropriate because:
- The sum of independent Poisson variables is Poisson: .
- Since is large (), the normal approximation is very accurate.
- The sample mean is then approximately .
(c) From the data:
(mean number of vehicles per interval is 8)
(mean has decreased)
Under : , .
p-value
Since , we reject .
Conclusion: There is sufficient evidence at the 5% significance level to conclude that the new traffic light system has reduced the mean number of vehicles per 10-second interval.
(d) The total count is .
Under : , approximated by , .
With continuity correction (since is discrete and we want ):
p-value
Since , we reject --- the same conclusion as part (c).
The slight difference in p-values (0.0199 vs 0.0219) is due to the continuity correction in part (d). Without the continuity correction, part (d) gives , exactly matching part (c). This confirms the equivalence: testing the sample mean is mathematically identical to testing the total count, since and dividing by a constant does not change the significance of the result.
IT-2: Calculating the Probability of Type II Error (with Probability)
Question:
A manufacturer claims that the mean lifetime of their batteries is 500 hours. A consumer group believes the mean lifetime is less than 500 hours and tests a random sample of 36 batteries. The lifetimes are normally distributed with standard deviation 40 hours.
(a) Find the critical region for a one-tailed test at the 5% significance level.
(b) If the true mean lifetime is actually 480 hours, calculate the probability of a Type II error.
(c) Calculate the power of the test when the true mean is 480 hours.
(d) The consumer group wants the power of the test to be at least 90% when the true mean is 480 hours. Find the minimum sample size required.
[Difficulty: hard. Combines hypothesis testing with probability calculations and optimisation of test power.]
Solution:
(a) , .
Under : .
.
Critical value:
Critical region: hours (to 1 d.p.).
We reject if the sample mean is at most approximately 489 hours.
(b) A Type II error occurs when we fail to reject even though is true (the true mean is 480).
When the true mean is :
We fail to reject when :
The probability of a Type II error is approximately 8.8%.
(c) The power of the test is the probability of correctly rejecting when is true:
The power is approximately 91.2% when the true mean is 480 hours.
(d) We need Power when , i.e., .
Let be the sample size. Under : .
Critical value: under :
Under (): .
For this probability to be at most 0.10, we need:
Since must be an integer, the minimum sample size is .
Verification: With : .
Critical value .
Power .
With : .
Critical value .
Power .
So is the minimum sample size.
IT-3: Testing Whether a Correlation Coefficient is Significant (with Correlation)
Question:
A psychologist investigates the relationship between hours of sleep () and reaction time (, in ms) for a random sample of 15 adults. She calculates the product moment correlation coefficient to be .
(a) Stating your hypotheses clearly, test at the 5% significance level whether there is evidence of a negative correlation between hours of sleep and reaction time. The critical value for a one-tailed test with at the 5% level is .
(b) Calculate the coefficient of determination and interpret it in the context of this study.
(c) The psychologist fits a regression line of on and obtains . Using this regression line, predict the reaction time for someone who sleeps 8 hours. Explain why this prediction might not be reliable.
(d) A colleague collects data from 40 adults and obtains . Using the fact that the critical value for a one-tailed test with at the 5% level is , test whether there is evidence of negative correlation. Compare the p-values (qualitatively) for the two studies and explain the role of sample size in hypothesis testing for correlation.
[Difficulty: hard. Combines hypothesis testing for correlation with regression and interpretation of .]
Solution:
(a) Let be the population correlation coefficient.
(no linear correlation between sleep and reaction time)
(negative linear correlation)
Test statistic:
Critical value: (for a one-tailed test at the 5% level with )
Since , the test statistic falls in the critical region.
Conclusion: There is sufficient evidence at the 5% significance level to reject and conclude that there is evidence of a negative correlation between hours of sleep and reaction time in the population.
This means that as hours of sleep increase, reaction time tends to decrease (i.e., reaction time improves with more sleep), which is consistent with established psychological research.
(b) The coefficient of determination:
Interpretation: approximately 27.0% of the variation in reaction time can be explained by the linear relationship with hours of sleep. The remaining 73.0% of the variation is due to other factors (e.g., age, caffeine intake, natural ability, time of day).
Note: is always non-negative, regardless of the sign of . The sign of tells us the direction of the relationship, while tells us the strength.
(c) For hours:
This prediction might not be reliable because:
-
Extrapolation: If the original data was collected from adults with, say, 4--9 hours of sleep, then 8 hours might be near the upper end of the data range. The regression line may not accurately predict beyond the observed range.
-
Weak correlation: With , only 27% of the variation is explained. The remaining 73% of variation means there is substantial scatter around the regression line, making individual predictions imprecise.
-
Confounding variables: Reaction time depends on many factors other than sleep. The prediction assumes all other factors are at their average values, which may not be the case for any specific individual.
(d) , .
Test statistic: .
Critical value: .
Since , the test statistic falls in the critical region.
Conclusion: There is sufficient evidence at the 5% level to conclude there is a negative correlation.
Comparison:
Study 1 (, ): Stronger correlation but smaller sample. The p-value is small but not extremely so, because with only 15 observations, a correlation of could plausibly occur by chance with moderate probability.
Study 2 (, ): Weaker correlation but larger sample. Despite the weaker correlation, the larger sample provides more evidence against , and the p-value may be comparable or even smaller.
The relationship between , , and the p-value is governed by the test statistic , which follows a -distribution with degrees of freedom under .
For Study 1:
For Study 2:
The -values are comparable ( vs ), confirming that both studies provide similar strength of evidence against . The larger sample in Study 2 compensates for the weaker correlation.
This demonstrates that statistical significance depends on both the strength of the effect (correlation) and the sample size. A large sample can detect even a weak but real correlation, while a small sample may fail to detect even a strong correlation.