Statistical Distributions — Diagnostic Tests
Unit Tests
Tests edge cases, boundary conditions, and common misconceptions for statistical distributions.
UT-1: Binomial Distribution — Identifying and Applying the Correct Model
Question:
For each of the following scenarios, determine whether the binomial distribution is appropriate. If it is, state the values of and . If it is not, identify which binomial condition is violated and name the correct distribution (if one has been covered).
(a) A bag contains 5 red and 3 blue balls. Balls are drawn one at a time without replacement until a red ball is drawn. is the number of blue balls drawn before the first red ball.
(b) A machine produces components, and 2% are defective. Components are packed in boxes of 50. is the number of defective components in a randomly selected box.
(c) A fair coin is tossed repeatedly until 3 heads have been obtained. is the total number of tosses required.
(d) A biased die is rolled 100 times. The probability of rolling a 6 is . is the number of times a 6 is rolled in the first 50 rolls.
(e) A student answers 10 multiple choice questions, each with 4 options. For the first 5 questions, she knows the answer. For the last 5, she guesses randomly. is the total number of correct answers.
[Difficulty: hard. Tests precise identification of binomial conditions, which is the most common source of error in distribution questions.]
Solution:
(a) The binomial distribution is NOT appropriate. The condition violated is independence of trials: since balls are drawn without replacement, the probability of drawing a red ball changes after each draw. The probability of red on the first draw is , but if the first ball is blue, the probability of red on the second draw becomes .
The correct distribution is the geometric distribution (number of failures before the first success in sampling without replacement follows a negative hypergeometric distribution, but the scenario of "number of blue balls before the first red" without replacement is best modelled by a direct probability calculation for each value).
(b) The binomial distribution IS appropriate.
- Fixed number of trials: components per box.
- Two outcomes: defective or not defective.
- Constant probability: (assuming the defect rate is constant and independent between components, which is reasonable for a large production process).
- Independent trials: the defect status of one component does not affect another.
So .
(c) The binomial distribution is NOT appropriate. The condition violated is fixed number of trials: the number of tosses is not fixed in advance; it depends on when the third head occurs.
The correct distribution is the negative binomial distribution (or Pascal distribution). If we define as the number of trials until the -th success, then .
(d) The binomial distribution IS appropriate.
- Fixed number of trials: (the first 50 rolls).
- Two outcomes: 6 or not 6.
- Constant probability: for each roll.
- Independent trials: die rolls are independent.
So .
The fact that there are 100 total rolls is irrelevant --- we are only considering the first 50.
(e) The binomial distribution is NOT appropriate. The condition violated is constant probability of success: for the first 5 questions, , but for the last 5 questions, . The probability of success changes partway through.
The correct approach is to split where (deterministic: always 5) and . Then and .
UT-2: Normal Distribution — Continuity Correction and Sign Errors
Question:
The random variable follows a normal distribution . (Note: the second parameter is the variance, so .)
(a) Find .
(b) The random variable follows a binomial distribution . Using a normal approximation with continuity correction, approximate .
(c) A student uses the normal approximation to approximate and writes . Identify the two errors in this working and provide the correct calculation.
(d) Verify that the normal approximation is appropriate for this binomial distribution by checking the criteria and .
[Difficulty: hard. Tests the most error-prone aspects of normal approximation: continuity correction and sign errors.]
Solution:
(a) , so , .
Standardising:
(b) .
Continuity correction: Since is discrete and we want , we use for the normal approximation.
(c) The student has made two errors:
-
No continuity correction: The student used 50 instead of 50.5. For a discrete distribution approximated by a continuous one, should use 50.5, not 50.
-
Missing the subtraction in the numerator: The student wrote , which is actually correct for the standardisation formula (though without the continuity correction). However, if the student had meant to compute and wrote , that would be a sign error. The standardisation is , so the numerator must be (or with continuity correction, ). The student's version gives , which underestimates the correct -value of 0.5704.
The correct calculation is:
(d) Checking the criteria:
Both criteria are satisfied, so the normal approximation is appropriate. The approximation will be good because both and are well above 5 (they are 48 and 32 respectively).
UT-3: Combining Independent Normal Variables and Linear Transformations
Question:
The time (in minutes) that employee takes to complete a task is , and the time employee takes is . The times are independent.
(a) Find the probability that employee completes the task in less than 22 minutes.
(b) Both employees work on the same task sequentially (employee goes first, then employee ). Find the probability that the total time for the task is less than 60 minutes.
(c) The manager defines efficiency as . Find the mean and variance of .
(d) A student calculates and concludes that . Identify the error and calculate the correct value.
[Difficulty: hard. Tests the rules for combining normal variables and the common error of adding standard deviations.]
Solution:
(a) , so , .
(b) The total time is .
Since and are independent normal variables, is also normally distributed:
So , .
The key result used here is: if and are independent, then . The variances add, not the standard deviations.
(c) .
For the variance, we use when and are independent:
Note: the sign of the coefficient does not affect the variance because .
(d) The student's error is adding standard deviations. The correct rule is:
The student's answer of is incorrect. The correct answer is 5.
The formula is only valid when and are perfectly positively correlated (), which is not the case here (they are independent). For independent variables, the variances add, so the standard deviation is the square root of the sum of variances, which is always less than or equal to the sum of standard deviations (by the triangle inequality for norms).
Integration Tests
Tests synthesis of statistical distributions with other topics. Requires combining concepts from multiple units.
IT-1: Hypothesis Test Using a Binomial Distribution (with Hypothesis Testing)
Question:
A supermarket claims that exactly 30% of its customers use reusable bags. An environmental group believes the true proportion is higher and surveys 20 randomly selected customers. They find that 9 of the 20 customers use reusable bags.
(a) Stating your hypotheses clearly, carry out a hypothesis test at the 5% significance level to determine whether there is evidence that the proportion of customers using reusable bags is greater than 30%.
(b) Calculate the actual significance level (the probability of a Type I error) for this test. Explain why it differs from the stated 5%.
(c) The environmental group increases their sample size to 50 customers and finds that 21 use reusable bags. Carry out the test again at the 5% level and compare your conclusion with part (a).
(d) Explain what is meant by the power of this test, and describe how increasing the sample size affects the power.
[Difficulty: hard. Combines binomial distribution with hypothesis testing, including actual significance level and power.]
Solution:
(a) Let be the proportion of customers using reusable bags.
(the proportion is 30%)
(the proportion is greater than 30%)
Under : .
We need the critical region. The test is one-tailed (upper tail).
We find the smallest value such that :
Using the binomial cumulative distribution with , :
Since , 9 is not in the critical region.
Since , the critical region is .
The observed value is , which does not fall in the critical region.
Conclusion: There is insufficient evidence to reject . The data does not provide sufficient evidence that the proportion of customers using reusable bags is greater than 30%.
(b) The actual significance level is the probability of rejecting when is true, which equals the probability of falling in the critical region under :
This differs from the stated 5% because the binomial distribution is discrete. There is no critical value that gives exactly 5%. The closest we can get is 4.80% (with critical region ) or 11.33% (with critical region ). We choose the critical region that does not exceed the stated significance level.
(c) Under : , so , .
Using the normal approximation: , .
With continuity correction for :
Since , the critical region is approximately .
Alternatively, using exact binomial probabilities: . The observed value is , which is in the critical region.
Conclusion: With the larger sample, there is sufficient evidence to reject at the 5% level. This demonstrates that a larger sample size provides more statistical power, even when the observed proportion () is similar to the smaller sample ().
(d) The power of a test is the probability of correctly rejecting when is true. It equals .
For this test, the power depends on the true value of . If the true proportion were, say, , the power would be:
So the test has about 58.8% power to detect a true proportion of 0.5.
Increasing the sample size increases the power of the test because:
- The distribution under becomes more concentrated (smaller standard deviation), so the critical region starts at a proportionally lower value.
- The distribution under also becomes more concentrated, but the separation between the and distributions increases relative to their spread.
- This makes it easier to distinguish between and , reducing the probability of a Type II error and increasing the power.
IT-2: Poisson Approximation and Continuous Uniform Distribution (with Integration)
Question:
A call centre receives an average of 2.4 calls per minute.
(a) Using the Poisson distribution, find the probability that the call centre receives exactly 3 calls in a one-minute period.
(b) Using a suitable approximation, find the probability that the call centre receives more than 60 calls in a 25-minute period. Justify your choice of approximation.
(c) The time between consecutive calls, minutes, follows an exponential distribution with mean minutes. However, for simplicity, a student models using a continuous uniform distribution on . Using this uniform model, find:
(i) The probability that the time between calls is less than 10 seconds.
(ii) The median time between calls.
(iii) The 90th percentile of the time between calls.
(d) Compare the mean of the student's uniform model with the true mean of the exponential distribution. Comment on the suitability of the uniform model.
[Difficulty: hard. Combines Poisson distribution with continuous uniform distribution and integration.]
Solution:
(a) Let = number of calls in one minute. Then .
(b) Let = number of calls in 25 minutes. Then .
Since is large, we can use the normal approximation .
With continuity correction:
The normal approximation is justified because , which is the standard criterion for approximating a Poisson distribution with a normal distribution.
(c) The student models .
The probability density function is:
(i) 10 seconds minutes.
(ii) The median satisfies :
For a uniform distribution, the median always equals the midpoint of the interval.
(iii) The 90th percentile satisfies :
(d) Uniform model mean: minutes.
True exponential mean: minutes.
The uniform model mean (0.25 minutes = 15 seconds) is significantly lower than the true mean (0.417 minutes = 25 seconds). The uniform model assigns equal probability density to all values in , which means it underestimates the likelihood of longer waiting times. The exponential distribution has a peak near 0 and a long right tail, which is more realistic for inter-arrival times.
The uniform model is unsuitable because:
- It gives a maximum possible waiting time of 0.5 minutes (30 seconds), whereas the exponential distribution has no upper bound.
- It assigns the same density to all waiting times, not reflecting the reality that short waiting times are more common than long ones.
- The mean is off by approximately 40%.
IT-3: Expected Trials Until First Success — Geometric Distribution (with Sequences)
Question:
In a game, a player rolls a fair six-sided die. The player wins if they roll a 6. They keep rolling until they get a 6.
(a) Find the probability that the player wins on the 3rd roll.
(b) Find the expected number of rolls required to get the first 6.
(c) Find the probability that the player needs more than 4 rolls to get a 6. Express your answer as a single fraction.
(d) The player modifies the game: they roll two fair six-sided dice simultaneously and win if the sum of the two dice is 8. Let be the number of rolls (pairs of dice) until the first win.
(i) Find the probability of winning on a single roll.
(ii) follows a geometric distribution. State its parameter and find and .
(iii) The casino charges per roll and pays when the player wins. Find the value of for which the game is fair (expected net gain is zero).
[Difficulty: hard. Combines geometric distribution with sequences, expectation, and a fairness calculation.]
Solution:
(a) Let = number of rolls until the first 6. Then where .
(b) For : .
On average, the player needs 6 rolls to get the first 6. This is intuitive: with probability per roll, you expect to need rolls.
(c)
This uses the memoryless property of the geometric distribution: .
(d) (i) The possible sums when rolling two dice are 2 through 12. We count the outcomes giving sum 8:
(2,6), (3,5), (4,4), (5,3), (6,2) --- that is 5 outcomes out of 36.
(ii) .
(iii) Let the net gain be . The player pays per roll and receives upon winning. The number of rolls is .
Total cost (in pounds), winnings (upon winning, which always happens eventually since ).
Net gain .
For a fair game: .
The casino should pay for the game to be fair.
Alternatively, thinking of it per roll: the expected gain per roll is (lose with probability , gain with probability ). Setting this to zero:
Both approaches give the same answer, confirming the result.