Skip to main content

Hypothesis Testing — Diagnostic Tests

Unit Tests

Tests edge cases, boundary conditions, and common misconceptions for hypothesis testing.

UT-1: One-Tailed vs Two-Tailed Tests and Critical Region Identification

Question:

A machine fills bags of sugar. The weight of sugar in each bag is normally distributed with mean μ\mu grams and standard deviation 0.5 g. The machine is set to produce bags with mean weight 500 g. A quality control inspector takes a random sample of 16 bags and measures their mean weight.

(a) The inspector wishes to test whether the machine is underfilling the bags. State the null and alternative hypotheses for a suitable test.

(b) Find the critical region for this test at the 5% significance level. Give your answer in terms of the sample mean xˉ\bar{x}.

(c) A different inspector believes the machine could be overfilling or underfilling, and proposes a two-tailed test. Find the critical region for the two-tailed test at the 5% significance level.

(d) The sample mean is found to be xˉ=499.4\bar{x} = 499.4 g. Carry out both the one-tailed test from part (a) and the two-tailed test from part (c). Show that they lead to different conclusions, and explain why.

(e) Explain why the one-tailed test has a larger critical region in the tail of interest compared to either tail of the two-tailed test, and discuss the implications for Type I and Type II errors.

[Difficulty: hard. Tests the critical distinction between one-tailed and two-tailed tests and their practical consequences.]

Solution:

(a) Let μ\mu be the population mean weight of sugar in bags.

H0:μ=500H_0: \mu = 500 (the machine is filling correctly)

H1:μ<500H_1: \mu \lt 500 (the machine is underfilling)

This is a one-tailed test (left-tailed).

(b) Under H0H_0: XˉN(500,0.5216)=N(500,0.2516)=N(500,0.015625)\bar{X} \sim N(500, \frac{0.5^2}{16}) = N(500, \frac{0.25}{16}) = N(500, 0.015625).

So σXˉ=0.015625=0.125\sigma_{\bar{X}} = \sqrt{0.015625} = 0.125.

The critical region is where P(Xˉc)=0.05\mathrm{P}(\bar{X} \leq c) = 0.05 under H0H_0:

P(Zc5000.125)=0.05\mathrm{P}\left(Z \leq \frac{c - 500}{0.125}\right) = 0.05

c5000.125=1.6449\frac{c - 500}{0.125} = -1.6449

c=5001.6449×0.125=5000.2056=499.794c = 500 - 1.6449 \times 0.125 = 500 - 0.2056 = 499.794

Critical region: Xˉ499.79\bar{X} \leq 499.79 g (to 2 d.p.).

(c) For the two-tailed test at the 5% level, each tail has significance 0.052=0.025\frac{0.05}{2} = 0.025.

H0:μ=500H_0: \mu = 500, H1:μ500H_1: \mu \neq 500.

Lower critical value:

P(Zc15000.125)=0.025    c15000.125=1.96\mathrm{P}\left(Z \leq \frac{c_1 - 500}{0.125}\right) = 0.025 \implies \frac{c_1 - 500}{0.125} = -1.96

c1=5001.96×0.125=5000.245=499.755c_1 = 500 - 1.96 \times 0.125 = 500 - 0.245 = 499.755

Upper critical value:

c25000.125=1.96    c2=500+0.245=500.245\frac{c_2 - 500}{0.125} = 1.96 \implies c_2 = 500 + 0.245 = 500.245

Critical region: Xˉ499.76\bar{X} \leq 499.76 or Xˉ500.24\bar{X} \geq 500.24 g (to 2 d.p.).

(d) One-tailed test: xˉ=499.4\bar{x} = 499.4.

Test statistic: z=499.45000.125=0.60.125=4.8z = \frac{499.4 - 500}{0.125} = \frac{-0.6}{0.125} = -4.8.

Since 4.8<1.6449-4.8 \lt -1.6449, the test statistic falls in the critical region.

Conclusion (one-tailed): Reject H0H_0. There is sufficient evidence that the machine is underfilling.

Two-tailed test: z=4.8z = -4.8.

Since 4.8<1.96-4.8 \lt -1.96, the test statistic falls in the lower critical region.

Conclusion (two-tailed): Reject H0H_0. There is sufficient evidence that the mean fill weight differs from 500 g.

In this case, both tests lead to rejection because the evidence is very strong (z=4.8z = -4.8 is far in the tail). However, consider if xˉ=499.77\bar{x} = 499.77 instead:

  • One-tailed: z=499.775000.125=1.84z = \frac{499.77 - 500}{0.125} = -1.84. Since 1.84<1.6449-1.84 \lt -1.6449, we reject H0H_0.
  • Two-tailed: z=1.84z = -1.84. Since 1.96<1.84<1.96-1.96 \lt -1.84 \lt 1.96, we do not reject H0H_0.

This shows that the one-tailed test is more powerful for detecting deviations in the specified direction, but it should only be used when there is a genuine prior reason to test in one direction only.

(e) The one-tailed test allocates the entire 5% significance level to one tail, giving a critical value of z=±1.645z = \pm 1.645. The two-tailed test splits the 5% equally between both tails, giving critical values of z=±1.96z = \pm 1.96.

In the tail of interest (e.g., the left tail for H1:μ<500H_1: \mu \lt 500):

  • One-tailed: critical value z=1.645z = -1.645
  • Two-tailed: critical value z=1.96z = -1.96

The one-tailed critical region extends further into the tail (includes values between z=1.96z = -1.96 and z=1.645z = -1.645 that the two-tailed test would not catch).

Implications:

  • Type I error rate: Both tests have the same overall Type I error rate (5%). The one-tailed test concentrates this 5% in one direction.
  • Type II error rate: The one-tailed test has a lower Type II error rate (higher power) for detecting deviations in the specified direction, because its critical region is larger in that direction.
  • Risk: The one-tailed test cannot detect deviations in the opposite direction. If the machine is actually overfilling, the one-tailed test (testing for underfilling) will almost certainly fail to reject H0H_0, committing a Type II error.

UT-2: Type I and Type II Errors — Calculation and Interpretation

Question:

A doctor uses a blood test to diagnose a condition. The test is based on a biomarker level BB which is normally distributed with mean μ\mu and standard deviation 8. For healthy individuals, μ=50\mu = 50. For individuals with the condition, μ=62\mu = 62.

The doctor diagnoses the condition if B>58B > 58.

(a) Calculate the probability of a Type I error (diagnosing the condition in a healthy individual).

(b) Calculate the probability of a Type II error (failing to diagnose the condition in an affected individual).

(c) The doctor wants both the Type I error rate and the Type II error rate to be at most 5%. Show that a single threshold value cannot achieve both simultaneously with the given distributions.

(d) The doctor increases the sample size to 4 independent blood tests and uses the mean biomarker level Bˉ\bar{B} as the test statistic. Find the threshold value cc such that both the Type I and Type II error rates are at most 5%. If this is not possible, find the threshold that minimises the sum of the two error rates.

[Difficulty: hard. Tests precise calculation of both types of error and the trade-off between them.]

Solution:

(a) A Type I error occurs when a healthy person is diagnosed with the condition.

For healthy individuals: BN(50,82)=N(50,64)B \sim N(50, 8^2) = N(50, 64).

P(Type I error)=P(B>58μ=50)=P(Z>58508)=P(Z>1)\mathrm{P}(\text{Type I error}) = \mathrm{P}(B > 58 \mid \mu = 50) = \mathrm{P}\left(Z > \frac{58 - 50}{8}\right) = \mathrm{P}(Z > 1)

=1Φ(1)=10.8413=0.1587= 1 - \Phi(1) = 1 - 0.8413 = 0.1587

The probability of a Type I error is approximately 15.9%.

(b) A Type II error occurs when an affected person is not diagnosed.

For affected individuals: BN(62,82)=N(62,64)B \sim N(62, 8^2) = N(62, 64).

P(Type II error)=P(B58μ=62)=P(Z58628)=P(Z0.5)\mathrm{P}(\text{Type II error}) = \mathrm{P}(B \leq 58 \mid \mu = 62) = \mathrm{P}\left(Z \leq \frac{58 - 62}{8}\right) = \mathrm{P}(Z \leq -0.5)

=Φ(0.5)=1Φ(0.5)=10.6915=0.3085= \Phi(-0.5) = 1 - \Phi(0.5) = 1 - 0.6915 = 0.3085

The probability of a Type II error is approximately 30.9%.

(c) For the Type I error rate to be at most 5%, we need:

P(B>cμ=50)0.05\mathrm{P}(B > c \mid \mu = 50) \leq 0.05

P(Z>c508)0.05    c5081.6449    c50+13.16=63.16\mathrm{P}\left(Z > \frac{c - 50}{8}\right) \leq 0.05 \implies \frac{c - 50}{8} \geq 1.6449 \implies c \geq 50 + 13.16 = 63.16

So c63.16c \geq 63.16.

For the Type II error rate to be at most 5%, we need:

P(Bcμ=62)0.05\mathrm{P}(B \leq c \mid \mu = 62) \leq 0.05

P(Zc628)0.05    c6281.6449    c62+13.16=75.16\mathrm{P}\left(Z \leq \frac{c - 62}{8}\right) \leq 0.05 \implies \frac{c - 62}{8} \geq 1.6449 \implies c \geq 62 + 13.16 = 75.16

So c75.16c \geq 75.16.

But for the Type I error rate to be at most 5%, we need c63.16c \geq 63.16, while for the Type II error rate to be at most 5%, we need c75.16c \geq 75.16. These are contradictory requirements in the sense that increasing cc reduces the Type I error rate but increases the Type II error rate, and vice versa.

Let me verify: at c=63.16c = 63.16, Type I error =5%= 5\% and Type II error =P(B63.16μ=62)=P(Z0.145)=0.558=55.8%= \mathrm{P}(B \leq 63.16 \mid \mu = 62) = \mathrm{P}(Z \leq 0.145) = 0.558 = 55.8\%. At c=75.16c = 75.16, Type II error =5%= 5\% but Type I error =P(B>75.16μ=50)=P(Z>3.145)0.08%= \mathrm{P}(B > 75.16 \mid \mu = 50) = \mathrm{P}(Z > 3.145) \approx 0.08\%.

As cc increases, Type I error decreases and Type II error increases. There is no value of cc where both are at most 5% simultaneously.

(d) With n=4n = 4 independent tests, BˉN(μ,644)=N(μ,16)\bar{B} \sim N(\mu, \frac{64}{4}) = N(\mu, 16), so σBˉ=4\sigma_{\bar{B}} = 4.

For Type I error 5%\leq 5\%: c50+1.6449×4=56.58c \geq 50 + 1.6449 \times 4 = 56.58

For Type II error 5%\leq 5\%: c62+1.6449×4=68.58c \geq 62 + 1.6449 \times 4 = 68.58

Wait --- for Type II error 5%\leq 5\%: we need P(Bˉcμ=62)0.05\mathrm{P}(\bar{B} \leq c \mid \mu = 62) \leq 0.05:

P(Zc624)0.05    c6241.6449    c68.58\mathrm{P}\left(Z \leq \frac{c - 62}{4}\right) \leq 0.05 \implies \frac{c - 62}{4} \geq 1.6449 \implies c \geq 68.58

But at c=68.58c = 68.58: Type I error =P(Bˉ>68.58μ=50)=P(Z>4.645)0= \mathrm{P}(\bar{B} > 68.58 \mid \mu = 50) = \mathrm{P}(Z > 4.645) \approx 0, and Type II error =5%= 5\%.

At c=56.58c = 56.58: Type I error =5%= 5\%, Type II error =P(Bˉ56.58μ=62)=P(Z1.355)=8.8%= \mathrm{P}(\bar{B} \leq 56.58 \mid \mu = 62) = \mathrm{P}(Z \leq -1.355) = 8.8\%.

Still, there is no value where both are 5%\leq 5\%. Even with 4 samples, the distributions overlap too much.

To minimise the sum of error rates, we set the threshold where the two PDFs cross:

LB1RB◆◆LB4LB2πRB◆◆RBexp((c50)232)=LB1RB◆◆LB4LB2πRB◆◆RBexp((c62)232)\frac◆LB◆1◆RB◆◆LB◆4\sqrt◆LB◆2\pi◆RB◆◆RB◆\exp\left(-\frac{(c-50)^2}{32}\right) = \frac◆LB◆1◆RB◆◆LB◆4\sqrt◆LB◆2\pi◆RB◆◆RB◆\exp\left(-\frac{(c-62)^2}{32}\right)

(c50)2=(c62)2(c - 50)^2 = (c - 62)^2

c50=(c62)    2c=112    c=56c - 50 = -(c - 62) \implies 2c = 112 \implies c = 56

At c=56c = 56:

Type I error =P(Bˉ>56μ=50)=P(Z>1.5)=0.0668= \mathrm{P}(\bar{B} > 56 \mid \mu = 50) = \mathrm{P}(Z > 1.5) = 0.0668

Type II error =P(Bˉ56μ=62)=P(Z1.5)=0.0668= \mathrm{P}(\bar{B} \leq 56 \mid \mu = 62) = \mathrm{P}(Z \leq -1.5) = 0.0668

Sum =0.1336= 0.1336. This is the threshold that minimises the total error probability, giving equal error rates of 6.68% each.


UT-3: P-value Interpretation and "Accepting" the Null Hypothesis

Question:

A researcher tests whether a new drug reduces blood pressure. Under the null hypothesis H0H_0: the drug has no effect, so the change in blood pressure XN(0,152)X \sim N(0, 15^2). The alternative hypothesis is H1H_1: the drug reduces blood pressure, i.e., XN(μ,152)X \sim N(\mu, 15^2) with μ<0\mu \lt 0.

The researcher tests the drug on 25 patients and finds the mean reduction in blood pressure is xˉ=5.2\bar{x} = -5.2 mmHg.

(a) Calculate the p-value for this test.

(b) The researcher writes in her report: "The p-value is greater than 0.05, so we accept the null hypothesis. The drug has no effect." Identify two errors in this statement and rewrite it correctly.

(c) A colleague argues: "Since the p-value is greater than 0.05, the null hypothesis is probably true." Explain why this is incorrect, making reference to what the p-value actually measures.

(d) The researcher doubles the sample size to 50 patients and finds the same mean reduction of xˉ=5.2\bar{x} = -5.2 mmHg. Calculate the new p-value and explain why it is smaller even though the effect size (mean reduction) is the same.

[Difficulty: hard. Tests the most commonly misunderstood aspects of hypothesis testing: p-value interpretation and the language of conclusions.]

Solution:

(a) Under H0H_0: XˉN(0,15225)=N(0,9)\bar{X} \sim N\left(0, \frac{15^2}{25}\right) = N(0, 9), so σXˉ=3\sigma_{\bar{X}} = 3.

z=LBxˉ0RB◆◆LB3RB=5.23=1.733z = \frac◆LB◆\bar{x} - 0◆RB◆◆LB◆3◆RB◆ = \frac{-5.2}{3} = -1.733

This is a one-tailed test (left-tailed), so the p-value is:

p-value=P(Z1.733)=Φ(1.733)=1Φ(1.733)=10.9586=0.0414\text{p-value} = \mathrm{P}(Z \leq -1.733) = \Phi(-1.733) = 1 - \Phi(1.733) = 1 - 0.9586 = 0.0414

The p-value is approximately 0.0414.

Since 0.0414<0.050.0414 \lt 0.05, the result is statistically significant at the 5% level.

(b) The researcher's statement contains two errors:

  1. "Accept the null hypothesis": We never "accept" H0H_0. The correct language is "there is insufficient evidence to reject H0H_0" or "we fail to reject H0H_0." The distinction matters because a lack of evidence against H0H_0 is not the same as evidence for H0H_0. The drug may have a small effect that the test was not powerful enough to detect.

  2. "The drug has no effect": Failing to reject H0H_0 does not prove that H0H_0 is true. The correct statement is that "the data does not provide sufficient evidence to conclude that the drug reduces blood pressure." The drug might still have an effect; we simply cannot detect it with this sample size.

Corrected statement: "The p-value is [greater/less] than 0.05, so there is [insufficient/sufficient] evidence at the 5% significance level to reject the null hypothesis that the drug has no effect on blood pressure."

(c) The colleague's statement is incorrect because the p-value does not measure the probability that H0H_0 is true. The p-value is:

The probability of obtaining a test statistic at least as extreme as the one observed, assuming H0H_0 is true.

The p-value is a conditional probability: P(dataH0)\mathrm{P}(\text{data} \mid H_0), not P(H0data)\mathrm{P}(H_0 \mid \text{data}).

A large p-value means the observed data is consistent with H0H_0, but it does not mean H0H_0 is probably true. The data could also be consistent with a small but non-zero effect. For example, if the true effect is a reduction of 2 mmHg (which is clinically meaningful), a small sample might still produce a large p-value.

To determine P(H0data)\mathrm{P}(H_0 \mid \text{data}) would require Bayesian methods (prior probabilities), which go beyond the scope of classical hypothesis testing.

(d) With n=50n = 50: XˉN(0,15250)=N(0,4.5)\bar{X} \sim N\left(0, \frac{15^2}{50}\right) = N(0, 4.5), so σXˉ=4.52.121\sigma_{\bar{X}} = \sqrt{4.5} \approx 2.121.

z=LB5.2RB◆◆LB4.5RB=5.22.121=2.451z = \frac◆LB◆-5.2◆RB◆◆LB◆\sqrt{4.5}◆RB◆ = \frac{-5.2}{2.121} = -2.451

p-value=P(Z2.451)=1Φ(2.451)=10.9929=0.0071\text{p-value} = \mathrm{P}(Z \leq -2.451) = 1 - \Phi(2.451) = 1 - 0.9929 = 0.0071

The new p-value is approximately 0.0071, which is much smaller than the original 0.0414.

The p-value decreases because with a larger sample, the standard error LBσRB◆◆LBnRB\frac◆LB◆\sigma◆RB◆◆LB◆\sqrt{n}◆RB◆ is smaller. The same mean reduction of 5.2 mmHg represents a larger number of standard errors away from the null hypothesis value of 0. This makes the evidence against H0H_0 stronger, even though the observed effect size is the same.

This illustrates a fundamental principle: statistical significance depends on both the effect size and the sample size. With a large enough sample, even a very small effect can produce a statistically significant result.


Integration Tests

Tests synthesis of hypothesis testing with other topics. Requires combining concepts from multiple units.

IT-1: Choosing the Correct Distribution for a Test (with Statistical Distributions)

Question:

A traffic engineer monitors a busy junction and records the number of vehicles passing through per 10-second interval. Over a long period, the mean number of vehicles per 10-second interval is 8.

After a new traffic light system is installed, the engineer records the number of vehicles in 20 randomly selected 10-second intervals:

6,7,5,8,4,9,6,7,5,8,6,7,4,9,5,8,6,7,5,86, 7, 5, 8, 4, 9, 6, 7, 5, 8, 6, 7, 4, 9, 5, 8, 6, 7, 5, 8

(a) The engineer proposes to use a Poisson distribution to model the data. Before the new system, the number of vehicles XPo(8)X \sim \text{Po}(8). State the mean and variance of XX, and explain why the Poisson distribution is a reasonable model for this scenario.

(b) The engineer wants to test whether the new traffic light system has changed the mean number of vehicles per interval. She decides to use the sample mean. State why using the sample mean directly with a Poisson test is difficult, and explain why a normal approximation would be appropriate.

(c) Using the sample data, calculate the sample mean xˉ\bar{x} and carry out a hypothesis test at the 5% significance level to determine whether the mean number of vehicles has decreased. Use the normal approximation to the Poisson distribution.

(d) An alternative approach is to use the total count T=xiT = \sum x_i and model TPo(20λ)T \sim \text{Po}(20\lambda) under the null hypothesis. Carry out this test and show that it gives the same conclusion as part (c).

[Difficulty: hard. Combines distribution selection with hypothesis testing using two equivalent approaches.]

Solution:

(a) For XPo(8)X \sim \text{Po}(8):

E(X)=8,Var(X)=8\mathrm{E}(X) = 8, \quad \mathrm{Var}(X) = 8

The Poisson distribution is appropriate because:

  • Vehicles arrive independently at the junction.
  • The rate is approximately constant over time.
  • Multiple vehicles can pass through in the same interval (not Bernoulli trials).
  • The events (vehicle arrivals) are rare relative to the time scale (individual vehicle arrivals are instantaneous events in a continuous time process).

These are the standard assumptions of the Poisson process.

(b) Testing the sample mean directly with a Poisson distribution is difficult because the sum (or mean) of independent Poisson random variables is also Poisson-distributed, but the Poisson distribution is discrete and the critical values must be found from Poisson cumulative probability tables. With a mean of 8 per interval and 20 intervals, the total is Po(160), which is large and would require interpolation in tables.

The normal approximation is appropriate because:

  • The sum of independent Poisson variables is Poisson: TPo(160)T \sim \text{Po}(160).
  • Since λ=160\lambda = 160 is large (λ>10\lambda > 10), the normal approximation TN(160,160)T \approx N(160, 160) is very accurate.
  • The sample mean Xˉ=Tn\bar{X} = \frac{T}{n} is then approximately N(λ,LBλRB◆◆LBnRB)=N(8,160400)=N(8,0.4)N\left(\lambda, \frac◆LB◆\lambda◆RB◆◆LB◆n◆RB◆\right) = N\left(8, \frac{160}{400}\right) = N(8, 0.4).

(c) From the data:

xˉ=6+7+5+8+4+9+6+7+5+8+6+7+4+9+5+8+6+7+5+820=13420=6.7\bar{x} = \frac{6+7+5+8+4+9+6+7+5+8+6+7+4+9+5+8+6+7+5+8}{20} = \frac{134}{20} = 6.7

H0:λ=8H_0: \lambda = 8 (mean number of vehicles per interval is 8)

H1:λ<8H_1: \lambda \lt 8 (mean has decreased)

Under H0H_0: XˉN(8,0.4)\bar{X} \approx N(8, 0.4), σ=0.40.6325\sigma = \sqrt{0.4} \approx 0.6325.

z=6.780.6325=1.30.6325=2.056z = \frac{6.7 - 8}{0.6325} = \frac{-1.3}{0.6325} = -2.056

p-value =P(Z<2.056)=1Φ(2.056)=10.9801=0.0199= \mathrm{P}(Z \lt -2.056) = 1 - \Phi(2.056) = 1 - 0.9801 = 0.0199

Since 0.0199<0.050.0199 \lt 0.05, we reject H0H_0.

Conclusion: There is sufficient evidence at the 5% significance level to conclude that the new traffic light system has reduced the mean number of vehicles per 10-second interval.

(d) The total count is T=134T = 134.

Under H0H_0: TPo(160)T \sim \text{Po}(160), approximated by N(160,160)N(160, 160), σ=16012.649\sigma = \sqrt{160} \approx 12.649.

With continuity correction (since TT is discrete and we want P(T134)\mathrm{P}(T \leq 134)):

z=LB134.5160RB◆◆LB160RB=25.512.649=2.016z = \frac◆LB◆134.5 - 160◆RB◆◆LB◆\sqrt{160}◆RB◆ = \frac{-25.5}{12.649} = -2.016

p-value =P(Z<2.016)=1Φ(2.016)=10.9781=0.0219= \mathrm{P}(Z \lt -2.016) = 1 - \Phi(2.016) = 1 - 0.9781 = 0.0219

Since 0.0219<0.050.0219 \lt 0.05, we reject H0H_0 --- the same conclusion as part (c).

The slight difference in p-values (0.0199 vs 0.0219) is due to the continuity correction in part (d). Without the continuity correction, part (d) gives z=LB134160RB◆◆LB160RB=2.056z = \frac◆LB◆134 - 160◆RB◆◆LB◆\sqrt{160}◆RB◆ = -2.056, exactly matching part (c). This confirms the equivalence: testing the sample mean is mathematically identical to testing the total count, since Xˉ=Tn\bar{X} = \frac{T}{n} and dividing by a constant nn does not change the significance of the result.


IT-2: Calculating the Probability of Type II Error (with Probability)

Question:

A manufacturer claims that the mean lifetime of their batteries is 500 hours. A consumer group believes the mean lifetime is less than 500 hours and tests a random sample of 36 batteries. The lifetimes are normally distributed with standard deviation 40 hours.

(a) Find the critical region for a one-tailed test at the 5% significance level.

(b) If the true mean lifetime is actually 480 hours, calculate the probability of a Type II error.

(c) Calculate the power of the test when the true mean is 480 hours.

(d) The consumer group wants the power of the test to be at least 90% when the true mean is 480 hours. Find the minimum sample size required.

[Difficulty: hard. Combines hypothesis testing with probability calculations and optimisation of test power.]

Solution:

(a) H0:μ=500H_0: \mu = 500, H1:μ<500H_1: \mu \lt 500.

Under H0H_0: XˉN(500,40236)=N(500,160036)=N(500,44.44)\bar{X} \sim N\left(500, \frac{40^2}{36}\right) = N(500, \frac{1600}{36}) = N(500, 44.44).

σXˉ=406=203\sigma_{\bar{X}} = \frac{40}{6} = \frac{20}{3}.

Critical value:

P(Xˉc)=0.05    c50020/3=1.6449\mathrm{P}(\bar{X} \leq c) = 0.05 \implies \frac{c - 500}{20/3} = -1.6449

c=5001.6449×203=50010.966=489.03c = 500 - 1.6449 \times \frac{20}{3} = 500 - 10.966 = 489.03

Critical region: Xˉ489.0\bar{X} \leq 489.0 hours (to 1 d.p.).

We reject H0H_0 if the sample mean is at most approximately 489 hours.

(b) A Type II error occurs when we fail to reject H0H_0 even though H1H_1 is true (the true mean is 480).

When the true mean is μ=480\mu = 480:

XˉN(480,160036)=N(480,44.44)\bar{X} \sim N\left(480, \frac{1600}{36}\right) = N(480, 44.44)

We fail to reject H0H_0 when Xˉ>489.03\bar{X} > 489.03:

P(Type II error)=P(Xˉ>489.03μ=480)=P(Z>489.0348020/3)\mathrm{P}(\text{Type II error}) = \mathrm{P}(\bar{X} > 489.03 \mid \mu = 480) = \mathrm{P}\left(Z > \frac{489.03 - 480}{20/3}\right)

=P(Z>LB9.03×3RB◆◆LB20RB)=P(Z>1.355)= \mathrm{P}\left(Z > \frac◆LB◆9.03 \times 3◆RB◆◆LB◆20◆RB◆\right) = \mathrm{P}(Z > 1.355)

=1Φ(1.355)=10.9123=0.0877= 1 - \Phi(1.355) = 1 - 0.9123 = 0.0877

The probability of a Type II error is approximately 8.8%.

(c) The power of the test is the probability of correctly rejecting H0H_0 when H1H_1 is true:

Power=1P(Type II error)=10.0877=0.9123\text{Power} = 1 - \mathrm{P}(\text{Type II error}) = 1 - 0.0877 = 0.9123

The power is approximately 91.2% when the true mean is 480 hours.

(d) We need Power 0.90\geq 0.90 when μ=480\mu = 480, i.e., P(Type II error)0.10\mathrm{P}(\text{Type II error}) \leq 0.10.

Let nn be the sample size. Under H0H_0: XˉN(500,1600n)\bar{X} \sim N(500, \frac{1600}{n}).

Critical value: P(Xˉc)=0.05\mathrm{P}(\bar{X} \leq c) = 0.05 under H0H_0:

c=5001.6449×LB40RB◆◆LBnRBc = 500 - 1.6449 \times \frac◆LB◆40◆RB◆◆LB◆\sqrt{n}◆RB◆

Under H1H_1 (μ=480\mu = 480): XˉN(480,1600n)\bar{X} \sim N(480, \frac{1600}{n}).

P(Type II error)=P(Xˉ>5001.6449×LB40RB◆◆LBnRB  |  μ=480)0.10\mathrm{P}(\text{Type II error}) = \mathrm{P}\left(\bar{X} > 500 - 1.6449 \times \frac◆LB◆40◆RB◆◆LB◆\sqrt{n}◆RB◆ \;\middle|\; \mu = 480\right) \leq 0.10

P(Z>LB5001.6449×40/n480RB◆◆LB40/nRB)0.10\mathrm{P}\left(Z > \frac◆LB◆500 - 1.6449 \times 40/\sqrt{n} - 480◆RB◆◆LB◆40/\sqrt{n}◆RB◆\right) \leq 0.10

P(Z>LB20n1.6449×40RB◆◆LB40RB)0.10\mathrm{P}\left(Z > \frac◆LB◆20\sqrt{n} - 1.6449 \times 40◆RB◆◆LB◆40◆RB◆\right) \leq 0.10

P(Z>LB20n65.796RB◆◆LB40RB)0.10\mathrm{P}\left(Z > \frac◆LB◆20\sqrt{n} - 65.796◆RB◆◆LB◆40◆RB◆\right) \leq 0.10

For this probability to be at most 0.10, we need:

LB20n65.796RB◆◆LB40RB1.2816\frac◆LB◆20\sqrt{n} - 65.796◆RB◆◆LB◆40◆RB◆ \geq 1.2816

20n65.79651.26420\sqrt{n} - 65.796 \geq 51.264

20n117.0620\sqrt{n} \geq 117.06

n5.853\sqrt{n} \geq 5.853

n34.26n \geq 34.26

Since nn must be an integer, the minimum sample size is n=35n = 35.

Verification: With n=35n = 35: σXˉ=LB40RB◆◆LB35RB=6.761\sigma_{\bar{X}} = \frac◆LB◆40◆RB◆◆LB◆\sqrt{35}◆RB◆ = 6.761.

Critical value =5001.6449×6.761=50011.12=488.88= 500 - 1.6449 \times 6.761 = 500 - 11.12 = 488.88.

P(Type II errorμ=480)=P(Z>488.884806.761)=P(Z>1.314)=0.0944\mathrm{P}(\text{Type II error} \mid \mu = 480) = \mathrm{P}\left(Z > \frac{488.88 - 480}{6.761}\right) = \mathrm{P}(Z > 1.314) = 0.0944

Power =10.0944=0.9056=90.6%90%= 1 - 0.0944 = 0.9056 = 90.6\% \geq 90\%.

With n=34n = 34: σXˉ=LB40RB◆◆LB34RB=6.860\sigma_{\bar{X}} = \frac◆LB◆40◆RB◆◆LB◆\sqrt{34}◆RB◆ = 6.860.

Critical value =5001.6449×6.860=488.71= 500 - 1.6449 \times 6.860 = 488.71.

P(Type II errorμ=480)=P(Z>488.714806.860)=P(Z>1.270)=0.1020\mathrm{P}(\text{Type II error} \mid \mu = 480) = \mathrm{P}\left(Z > \frac{488.71 - 480}{6.860}\right) = \mathrm{P}(Z > 1.270) = 0.1020

Power =89.8%<90%= 89.8\% \lt 90\%.

So n=35n = 35 is the minimum sample size.


IT-3: Testing Whether a Correlation Coefficient is Significant (with Correlation)

Question:

A psychologist investigates the relationship between hours of sleep (SS) and reaction time (RR, in ms) for a random sample of 15 adults. She calculates the product moment correlation coefficient to be r=0.52r = -0.52.

(a) Stating your hypotheses clearly, test at the 5% significance level whether there is evidence of a negative correlation between hours of sleep and reaction time. The critical value for a one-tailed test with n=15n = 15 at the 5% level is 0.441-0.441.

(b) Calculate the coefficient of determination r2r^2 and interpret it in the context of this study.

(c) The psychologist fits a regression line of RR on SS and obtains R=58028SR = 580 - 28S. Using this regression line, predict the reaction time for someone who sleeps 8 hours. Explain why this prediction might not be reliable.

(d) A colleague collects data from 40 adults and obtains r=0.31r = -0.31. Using the fact that the critical value for a one-tailed test with n=40n = 40 at the 5% level is 0.257-0.257, test whether there is evidence of negative correlation. Compare the p-values (qualitatively) for the two studies and explain the role of sample size in hypothesis testing for correlation.

[Difficulty: hard. Combines hypothesis testing for correlation with regression and interpretation of r2r^2.]

Solution:

(a) Let ρ\rho be the population correlation coefficient.

H0:ρ=0H_0: \rho = 0 (no linear correlation between sleep and reaction time)

H1:ρ<0H_1: \rho \lt 0 (negative linear correlation)

Test statistic: r=0.52r = -0.52

Critical value: 0.441-0.441 (for a one-tailed test at the 5% level with n=15n = 15)

Since r=0.52<0.441r = -0.52 \lt -0.441, the test statistic falls in the critical region.

Conclusion: There is sufficient evidence at the 5% significance level to reject H0H_0 and conclude that there is evidence of a negative correlation between hours of sleep and reaction time in the population.

This means that as hours of sleep increase, reaction time tends to decrease (i.e., reaction time improves with more sleep), which is consistent with established psychological research.

(b) The coefficient of determination:

r2=(0.52)2=0.2704r^2 = (-0.52)^2 = 0.2704

Interpretation: approximately 27.0% of the variation in reaction time can be explained by the linear relationship with hours of sleep. The remaining 73.0% of the variation is due to other factors (e.g., age, caffeine intake, natural ability, time of day).

Note: r2r^2 is always non-negative, regardless of the sign of rr. The sign of rr tells us the direction of the relationship, while r2r^2 tells us the strength.

(c) For S=8S = 8 hours:

R=58028(8)=580224=356 msR = 580 - 28(8) = 580 - 224 = 356 \text{ ms}

This prediction might not be reliable because:

  1. Extrapolation: If the original data was collected from adults with, say, 4--9 hours of sleep, then 8 hours might be near the upper end of the data range. The regression line may not accurately predict beyond the observed range.

  2. Weak correlation: With r2=0.27r^2 = 0.27, only 27% of the variation is explained. The remaining 73% of variation means there is substantial scatter around the regression line, making individual predictions imprecise.

  3. Confounding variables: Reaction time depends on many factors other than sleep. The prediction assumes all other factors are at their average values, which may not be the case for any specific individual.

(d) H0:ρ=0H_0: \rho = 0, H1:ρ<0H_1: \rho \lt 0.

Test statistic: r=0.31r = -0.31.

Critical value: 0.257-0.257.

Since 0.31<0.257-0.31 \lt -0.257, the test statistic falls in the critical region.

Conclusion: There is sufficient evidence at the 5% level to conclude there is a negative correlation.

Comparison:

Study 1 (n=15n = 15, r=0.52r = -0.52): Stronger correlation but smaller sample. The p-value is small but not extremely so, because with only 15 observations, a correlation of 0.52-0.52 could plausibly occur by chance with moderate probability.

Study 2 (n=40n = 40, r=0.31r = -0.31): Weaker correlation but larger sample. Despite the weaker correlation, the larger sample provides more evidence against H0H_0, and the p-value may be comparable or even smaller.

The relationship between rr, nn, and the p-value is governed by the test statistic t=rLBn21r2RBt = r\sqrt◆LB◆\frac{n-2}{1-r^2}◆RB◆, which follows a tt-distribution with n2n-2 degrees of freedom under H0H_0.

For Study 1: t=0.52LB1310.2704RB=0.5217.81=0.52×4.22=2.19t = -0.52\sqrt◆LB◆\frac{13}{1-0.2704}◆RB◆ = -0.52\sqrt{17.81} = -0.52 \times 4.22 = -2.19

For Study 2: t=0.31LB3810.0961RB=0.3142.06=0.31×6.49=2.01t = -0.31\sqrt◆LB◆\frac{38}{1-0.0961}◆RB◆ = -0.31\sqrt{42.06} = -0.31 \times 6.49 = -2.01

The tt-values are comparable (2.19-2.19 vs 2.01-2.01), confirming that both studies provide similar strength of evidence against H0H_0. The larger sample in Study 2 compensates for the weaker correlation.

This demonstrates that statistical significance depends on both the strength of the effect (correlation) and the sample size. A large sample can detect even a weak but real correlation, while a small sample may fail to detect even a strong correlation.