Skip to main content

Data Representation — Diagnostic Tests

Unit Tests

Tests edge cases, boundary conditions, and common misconceptions for data representation.

UT-1: Outlier Effect on Measures of Central Tendency and Spread

Question:

A botanist records the heights (in cm) of 12 sunflower plants from a controlled growth experiment:

42,  45,  47,  48,  49,  50,  51,  52,  53,  54,  55,  18042,\; 45,\; 47,\; 48,\; 49,\; 50,\; 51,\; 52,\; 53,\; 54,\; 55,\; 180

(a) Calculate the mean, median, and mode of the full dataset.

(b) The value 180 cm is identified as a measurement error (the actual height was 58 cm). Recalculate the mean, median, and mode after correcting this value.

(c) The interquartile range and standard deviation are both measures of spread. Without calculating the standard deviation of the original (uncorrected) dataset, determine which measure of spread is more affected by the outlier. Justify your answer using the properties of each measure.

(d) A student argues: "Since the median barely changed, we should always use the median instead of the mean." Construct a counterexample with a small dataset where the median gives a misleading measure of central tendency.

[Difficulty: hard. Tests understanding of when each measure is preferred and requires constructing a counterexample.]

Solution:

(a) There are 12 data values, so n=12n = 12.

Mean: xˉ=42+45+47+48+49+50+51+52+53+54+55+18012=82612=68.83 cm (2 d.p.)\bar{x} = \frac{42 + 45 + 47 + 48 + 49 + 50 + 51 + 52 + 53 + 54 + 55 + 180}{12} = \frac{826}{12} = 68.83 \text{ cm (2 d.p.)}

Median: Since n=12n = 12 is even, the median is the average of the 6th and 7th values. The ordered data is already given in ascending order.

Median=50+512=50.5 cm\text{Median} = \frac{50 + 51}{2} = 50.5 \text{ cm}

Mode: Every value appears exactly once, so there is no mode.

(b) Replacing 180 with 58 gives the dataset:

42,  45,  47,  48,  49,  50,  51,  52,  53,  54,  55,  5842,\; 45,\; 47,\; 48,\; 49,\; 50,\; 51,\; 52,\; 53,\; 54,\; 55,\; 58

Mean: xˉ=70412=58.67 cm (2 d.p.)\bar{x} = \frac{704}{12} = 58.67 \text{ cm (2 d.p.)}

Median: Average of 6th and 7th values:

Median=50+512=50.5 cm\text{Median} = \frac{50 + 51}{2} = 50.5 \text{ cm}

Mode: Still no mode.

(c) The standard deviation is more affected by the outlier. This is because the standard deviation involves squaring the deviations from the mean:

s=LBLB(xixˉ)2RB◆◆LBn1RB◆◆RBs = \sqrt◆LB◆\frac◆LB◆\sum(x_i - \bar{x})^2◆RB◆◆LB◆n-1◆RB◆◆RB◆

The outlier 180 is very far from the mean (about 111 cm away), so (18068.83)212370(180 - 68.83)^2 \approx 12370 contributes enormously to the sum of squared deviations. The interquartile range (IQR), by contrast, only uses Q1Q_1 and Q3Q_3, which depend on the middle 50% of the data. The outlier at 180 does not affect Q1Q_1 or Q3Q_3 at all, so the IQR is completely unchanged.

This is the fundamental advantage of the IQR over the standard deviation for skewed data or data with outliers: it is resistant (robust) to extreme values.

(d) Consider the dataset: {1,1,1,1,100}\{1, 1, 1, 1, 100\}.

  • Mean =1045=20.8= \frac{104}{5} = 20.8
  • Median =1= 1

The median of 1 suggests the "typical" value is 1, which is true for 4 out of 5 observations. However, the value 100 is a genuine part of the data (not an error), and the mean of 20.8 better reflects the overall level of the data. The median is misleading here because it completely ignores the magnitude of the upper tail.

This shows that the choice between mean and median depends on the context: the median is robust to outliers (good for error detection), but it discards information about the tails of the distribution.


UT-2: Misleading Histograms with Unequal Class Widths

Question:

The frequency distribution below shows the daily commuting times (in minutes) for 200 employees at a large company:

Commuting time tt (min)Frequency
0<t100 \lt t \leq 1012
10<t2010 \lt t \leq 2038
20<t3520 \lt t \leq 3556
35<t6035 \lt t \leq 6065
60<t9060 \lt t \leq 9029

(a) A student draws a histogram using the frequency on the vertical axis and the commuting time on the horizontal axis, making all bars the same width. Explain why this histogram is misleading, and state the correct quantity to plot on the vertical axis.

(b) Calculate the frequency density for each class and estimate the mean commuting time.

(c) An employee claims "the most common commuting time is between 35 and 60 minutes." Determine whether this claim is supported by the data, carefully distinguishing between the class with the highest frequency and the class with the highest frequency density.

(d) Estimate the proportion of employees who commute for more than 50 minutes, using the assumption that values are uniformly distributed within each class.

[Difficulty: hard. Tests the critical distinction between frequency and frequency density with unequal class widths.]

Solution:

(a) The class widths are not equal: 10, 10, 15, 25, and 30 minutes respectively. When bars are drawn with equal width, the area of each bar is proportional to the frequency, but the height should be proportional to the frequency density, not the raw frequency. By plotting raw frequency as the bar height with equal-width bars, the visual impression over-represents narrow classes and under-represents wide classes.

The correct quantity for the vertical axis is the frequency density, defined as:

Frequency density=LB◆Frequency◆RB◆◆LB◆Class width◆RB\text{Frequency density} = \frac◆LB◆\text{Frequency}◆RB◆◆LB◆\text{Class width}◆RB◆

(b) Frequency densities:

ClassWidthFrequencyFrequency density
0<t100 \lt t \leq 1010121.2
10<t2010 \lt t \leq 2010383.8
20<t3520 \lt t \leq 3515563.73
35<t6035 \lt t \leq 6025652.6
60<t9060 \lt t \leq 9030290.97

To estimate the mean, we use the midpoint of each class:

ClassMidpoint xxFrequency fffxfx
0<t100 \lt t \leq 1051260
10<t2010 \lt t \leq 201538570
20<t3520 \lt t \leq 3527.5561540
35<t6035 \lt t \leq 6047.5653087.5
60<t9060 \lt t \leq 9075292175

f=200,fx=7432.5\sum f = 200, \quad \sum fx = 7432.5

xˉ=7432.5200=37.2 minutes\bar{x} = \frac{7432.5}{200} = 37.2 \text{ minutes}

(c) The class 35<t6035 \lt t \leq 60 has the highest frequency (65), so more employees fall in this class than any other. However, the class with the highest frequency density is 10<t2010 \lt t \leq 20 (density 3.8), meaning the data is most concentrated (per unit time interval) in the 10--20 minute range.

The employee's claim is supported in the sense that the largest number of employees commute for 35--60 minutes. But the claim could be misleading if interpreted as "the most common single commuting time is in this range," since the density is highest in the 10--20 minute range. The wide class width of the 35--60 minute class inflates its frequency.

(d) We need the proportion with t>50t > 50.

  • In the class 35<t6035 \lt t \leq 60 (width 25, frequency 65): the fraction beyond 50 minutes is 605025=1025=0.4\frac{60 - 50}{25} = \frac{10}{25} = 0.4.

    Estimated frequency for 35<t6035 \lt t \leq 60 with t>50t > 50: 65×0.4=2665 \times 0.4 = 26.

  • In the class 60<t9060 \lt t \leq 90: all 29 employees commute for more than 50 minutes.

Total estimated frequency with t>50t > 50: 26+29=5526 + 29 = 55.

Proportion=55200=0.275\text{Proportion} = \frac{55}{200} = 0.275

So approximately 27.5% of employees commute for more than 50 minutes.


UT-3: Data Coding and Its Effect on Summary Statistics

Question:

The temperatures (in degrees Celsius) at a weather station at noon on 15 consecutive days are recorded. The summary statistics for the raw data are:

x=285,x2=5785\sum x = 285, \quad \sum x^2 = 5785

The data is coded using the formula y=x105y = \frac{x - 10}{5}.

(a) Find the mean and standard deviation of the coded data yy.

(b) A student claims that the standard deviation of yy equals the standard deviation of xx divided by 5. Another student claims it equals the standard deviation of xx divided by 5=5|5| = 5, so they agree. A third student says the standard deviation of yy equals the standard deviation of xx divided by bb where y=xaby = \frac{x - a}{b}, and asks: "Does it matter whether bb is positive or negative?" Resolve this dispute with a clear explanation.

(c) A second weather station uses the coding z=32xz = 3 - 2x. Without recalculating from the raw data, find the mean and variance of zz. Show that the variance of zz is the same as the variance of ww where w=2x3w = 2x - 3, and explain why this is the case.

[Difficulty: hard. Tests the precise effect of coding on variance, particularly the role of b2b^2 vs bb.]

Solution:

(a) For the raw data:

xˉ=LBxRB◆◆LBnRB=28515=19\bar{x} = \frac◆LB◆\sum x◆RB◆◆LB◆n◆RB◆ = \frac{285}{15} = 19

Sxx=x2LB(x)2RB◆◆LBnRB=5785285215=57858122515=57855415=370S_{xx} = \sum x^2 - \frac◆LB◆(\sum x)^2◆RB◆◆LB◆n◆RB◆ = 5785 - \frac{285^2}{15} = 5785 - \frac{81225}{15} = 5785 - 5415 = 370

Variance of x=Sxxn1=37014=1857\text{Variance of } x = \frac{S_{xx}}{n-1} = \frac{370}{14} = \frac{185}{7}

SD of x=LB1857RB\text{SD of } x = \sqrt◆LB◆\frac{185}{7}◆RB◆

For the coded data y=x105=15x2y = \frac{x - 10}{5} = \frac{1}{5}x - 2:

yˉ=15xˉ2=1952=3.82=1.8\bar{y} = \frac{1}{5}\bar{x} - 2 = \frac{19}{5} - 2 = 3.8 - 2 = 1.8

For the standard deviation: if y=xaby = \frac{x - a}{b}, then SD(y)=LB◆SD(x)RB◆◆LBbRB\text{SD}(y) = \frac◆LB◆\text{SD}(x)◆RB◆◆LB◆|b|◆RB◆.

SD(y)=LB1RB◆◆LB5RB×LB1857RB=15LB1857RB=LB185175RB=LB3735RB\text{SD}(y) = \frac◆LB◆1◆RB◆◆LB◆|5|◆RB◆ \times \sqrt◆LB◆\frac{185}{7}◆RB◆ = \frac{1}{5}\sqrt◆LB◆\frac{185}{7}◆RB◆ = \sqrt◆LB◆\frac{185}{175}◆RB◆ = \sqrt◆LB◆\frac{37}{35}◆RB◆

Alternatively:

Variance of y=152×Variance of x=125×1857=185175=3735\text{Variance of } y = \frac{1}{5^2} \times \text{Variance of } x = \frac{1}{25} \times \frac{185}{7} = \frac{185}{175} = \frac{37}{35}

(b) The key fact is:

Var(aX+b)=a2Var(X)\text{Var}(aX + b) = a^2 \text{Var}(X)

The constant bb (the additive shift) has no effect on the variance. The multiplicative factor aa affects the variance by a2a^2.

For y=x105=15x+(105)=15x2y = \frac{x - 10}{5} = \frac{1}{5}x + \left(\frac{-10}{5}\right) = \frac{1}{5}x - 2:

Var(y)=(15)2Var(x)=125Var(x)\text{Var}(y) = \left(\frac{1}{5}\right)^2 \text{Var}(x) = \frac{1}{25}\text{Var}(x)

So SD(y)=15SD(x)\text{SD}(y) = \frac{1}{5}\text{SD}(x).

The question about whether bb being positive or negative matters: it does not. Since the variance scales by a2=(1b)2=1b2a^2 = \left(\frac{1}{b}\right)^2 = \frac{1}{b^2}, and b2=(b)2b^2 = (-b)^2, the sign of bb is irrelevant. If we had coded as y=x105y = \frac{x - 10}{-5}, the variance would be the same. The mean would flip (yˉ\bar{y} becomes 1.8-1.8 instead of 1.81.8), but the spread is identical.

The first two students are correct that SD(yy) = SD(xx) / 5. The third student is correct to ask the question, and the answer is: it does not matter because variance depends on b2b^2, not bb.

(c) For z=32x=2x+3z = 3 - 2x = -2x + 3:

zˉ=2xˉ+3=2(19)+3=38+3=35\bar{z} = -2\bar{x} + 3 = -2(19) + 3 = -38 + 3 = -35

Var(z)=(2)2Var(x)=4×1857=7407\text{Var}(z) = (-2)^2 \text{Var}(x) = 4 \times \frac{185}{7} = \frac{740}{7}

For w=2x3w = 2x - 3:

wˉ=2(19)3=383=35\bar{w} = 2(19) - 3 = 38 - 3 = 35

Var(w)=22Var(x)=4×1857=7407\text{Var}(w) = 2^2 \text{Var}(x) = 4 \times \frac{185}{7} = \frac{740}{7}

The variances are equal: Var(z)=Var(w)\text{Var}(z) = \text{Var}(w).

This is because variance depends on the square of the scaling factor. Since both zz and ww use a scaling factor of magnitude 2, and (2)2=22=4(-2)^2 = 2^2 = 4, the variances are the same. The additive constant (3 or 3-3) and the sign of the multiplier only affect the mean, not the spread.


Integration Tests

Tests synthesis of data representation with other topics. Requires combining concepts from multiple units.

IT-1: Probability Distribution from Grouped Data (with Probability)

Question:

A factory produces bolts, and the length XX mm of each bolt is measured. The grouped frequency distribution for 500 bolts is:

Length xx (mm)Frequency
24.0x<24.524.0 \leq x \lt 24.520
24.5x<25.024.5 \leq x \lt 25.085
25.0x<25.525.0 \leq x \lt 25.5160
25.5x<26.025.5 \leq x \lt 26.0145
26.0x<26.526.0 \leq x \lt 26.575
26.5x<27.026.5 \leq x \lt 27.015

A bolt is classified as defective if its length is less than 24.5 mm or greater than 26.5 mm.

(a) Estimate the probability that a randomly selected bolt is defective.

(b) Bolts are packed in boxes of 10. Assuming the probability of a bolt being defective is independent between bolts and equal to your estimate from part (a), find the probability that a randomly selected box contains at least one defective bolt.

(c) The factory claims that the mean length of bolts is 25.5 mm. Using the midpoints of the classes, test this claim at the 5% significance level. You may assume the distribution of sample means is approximately normal. [The standard deviation of bolt lengths is estimated to be 0.60 mm.]

[Difficulty: hard. Combines grouped data estimation, binomial probability, and hypothesis testing.]

Solution:

(a) Defective bolts are those in the classes 24.0x<24.524.0 \leq x \lt 24.5 and 26.5x<27.026.5 \leq x \lt 27.0.

P(defective)=20+15500=35500=0.07P(\text{defective}) = \frac{20 + 15}{500} = \frac{35}{500} = 0.07

(b) Let DD be the number of defective bolts in a box of 10. Then DB(10,0.07)D \sim B(10, 0.07).

Using the complement:

P(D1)=1P(D=0)=1(0.93)10P(D \geq 1) = 1 - P(D = 0) = 1 - (0.93)^{10}

1(0.93)10=10.4839...=0.516 (3 s.f.)1 - (0.93)^{10} = 1 - 0.4839... = 0.516 \text{ (3 s.f.)}

There is approximately a 51.6% chance that a box contains at least one defective bolt.

(c) We test H0:μ=25.5H_0: \mu = 25.5 against H1:μ25.5H_1: \mu \neq 25.5.

Estimate the sample mean using class midpoints:

ClassMidpoint xxFrequency fffxfx
24.0x<24.524.0 \leq x \lt 24.524.2520485
24.5x<25.024.5 \leq x \lt 25.024.75852103.75
25.0x<25.525.0 \leq x \lt 25.525.251604040
25.5x<26.025.5 \leq x \lt 26.025.751453733.75
26.0x<26.526.0 \leq x \lt 26.526.25751968.75
26.5x<27.026.5 \leq x \lt 27.026.7515401.25

xˉ=12732.5500=25.465 mm\bar{x} = \frac{12732.5}{500} = 25.465 \text{ mm}

The test statistic under H0H_0:

z=LBxˉμRB◆◆LBσ/nRB=LB25.46525.5RB◆◆LB0.60/500RB=0.0350.02683=1.305z = \frac◆LB◆\bar{x} - \mu◆RB◆◆LB◆\sigma / \sqrt{n}◆RB◆ = \frac◆LB◆25.465 - 25.5◆RB◆◆LB◆0.60 / \sqrt{500}◆RB◆ = \frac{-0.035}{0.02683} = -1.305

For a two-tailed test at the 5% level, the critical values are z=±1.96z = \pm 1.96.

Since 1.96<1.305<1.96-1.96 \lt -1.305 \lt 1.96, the test statistic does not fall in the critical region.

Conclusion: There is insufficient evidence to reject H0H_0. The data is consistent with the factory's claim that the mean bolt length is 25.5 mm.


IT-2: Regression Residuals and Box Plot Analysis (with Correlation and Regression)

Question:

A scientist investigates the relationship between the dosage dd (in mg) of a drug and the response time rr (in seconds) for 8 patients. The data is:

dd12345678
rr12.19.88.27.56.15.35.04.9

The regression line of rr on dd is r=12.080.953dr = 12.08 - 0.953d.

(a) Calculate the residuals for each patient and display them as a stem-and-leaf diagram.

(b) Construct a box plot of the residuals. Use the box plot to assess whether the linear model is appropriate. In your answer, identify any potential outlier and discuss whether removing it would change the slope of the regression line.

(c) The scientist fits a second regression line after removing the data point with the largest absolute residual. Without recalculating, predict whether the new regression line will have a steeper or less steep slope. Justify your answer by considering how the least squares criterion works.

[Difficulty: hard. Combines residual analysis, box plot construction, and understanding of leverage in regression.]

Solution:

(a) Residuals =observedpredicted= \text{observed} - \text{predicted}:

ddObserved rrPredicted r^=12.080.953d\hat{r} = 12.08 - 0.953dResidual rr^r - \hat{r}
112.111.1270.973
29.810.174-0.374
38.29.221-1.021
47.58.268-0.768
56.17.315-1.215
65.36.362-1.062
75.05.409-0.409
84.94.4560.444

Rounded to 2 d.p.: 0.97, 0.37-0.37, 1.02-1.02, 0.77-0.77, 1.22-1.22, 1.06-1.06, 0.41-0.41, 0.44.

Stem-and-leaf (key: 12=1.21 \mid -2 = -1.2):

-1 | 2 1 0
-0 | 8 4 1
0 | 4
1 | 0

(b) Ordered residuals: 1.22-1.22, 1.06-1.06, 1.02-1.02, 0.77-0.77, 0.41-0.41, 0.37-0.37, 0.44, 0.97. (n=8n = 8)

Q1Q_1 = median of lower half =1.06+(1.02)2=1.04= \frac{-1.06 + (-1.02)}{2} = -1.04

Q3Q_3 = median of upper half =0.37+0.442=0.035= \frac{-0.37 + 0.44}{2} = 0.035

Median =0.77+(0.41)2=0.59= \frac{-0.77 + (-0.41)}{2} = -0.59

IQR=0.035(1.04)=1.075\text{IQR} = 0.035 - (-1.04) = 1.075

Lower fence =Q11.5×IQR=1.041.6125=2.6525= Q_1 - 1.5 \times \text{IQR} = -1.04 - 1.6125 = -2.6525

Upper fence =Q3+1.5×IQR=0.035+1.6125=1.6475= Q_3 + 1.5 \times \text{IQR} = 0.035 + 1.6125 = 1.6475

No residuals fall outside the fences, so there are no outliers. The box plot would show:

  • Whiskers from 1.22-1.22 to 0.97
  • Box from 1.04-1.04 to 0.035
  • Median line at 0.59-0.59

The residuals show a pattern: negative residuals are clustered for middle dosages (d=3,4,5,6d = 3, 4, 5, 6) while positive residuals appear at the extremes (d=1,8d = 1, 8). This U-shaped pattern in the residuals suggests the relationship may be slightly curved rather than perfectly linear, though the deviation is small.

(c) The largest absolute residual is at d=5d = 5 (residual =1.22= -1.22). The observed value r=6.1r = 6.1 is below the predicted value of 7.32, meaning the model over-predicts at this point.

Removing this point would likely make the slope slightly less steep (closer to zero). Here is why: the least squares regression line minimises the sum of squared residuals. The point at d=5d = 5 pulls the line down (it lies below the line). Removing it reduces the "pull downward" at the centre of the data range, allowing the line to sit slightly higher in the middle. Since the endpoints (d=1,8d = 1, 8) are close to the line, the line will adjust to be slightly flatter.

More precisely: the point (5,6.1)(5, 6.1) is below the current regression line, so it contributes a large negative residual. The least squares method adjusts to reduce this residual, which it does by pulling the line downward near d=5d = 5. Without this point, there is less need for the line to dip in the middle, so the slope becomes slightly less negative (i.e., less steep in magnitude).


IT-3: Continuous Probability Density Function and the Median (with Integration)

Question:

A continuous random variable XX has probability density function:

f(x)={364x20x40otherwisef(x) = \begin{cases} \frac{3}{64}x^2 & \quad 0 \leq x \leq 4 \\ 0 & \quad \text{otherwise} \end{cases}

(a) Verify that f(x)f(x) is a valid probability density function.

(b) Find the median of XX, giving your answer to 3 significant figures.

(c) Find the interquartile range of XX.

(d) The values of XX are recorded as a grouped frequency distribution using the classes 0x<10 \leq x \lt 1, 1x<21 \leq x \lt 2, 2x<32 \leq x \lt 3, 3x<43 \leq x \lt 4. Estimate the mean and standard deviation from this grouped data, and compare your answers with the true values E(X)=3\mathrm{E}(X) = 3 and SD(X)=LB3RB◆◆LB5RB\mathrm{SD}(X) = \frac◆LB◆3◆RB◆◆LB◆\sqrt{5}◆RB◆. Comment on the accuracy of the grouped estimates.

[Difficulty: hard. Combines integration of a PDF, quartile calculation, and comparison of grouped vs exact statistics.]

Solution:

(a) A valid PDF must satisfy f(x)0f(x) \geq 0 for all xx and f(x)dx=1\int_{-\infty}^{\infty} f(x)\,dx = 1.

Since x20x^2 \geq 0 and 364>0\frac{3}{64} > 0, we have f(x)0f(x) \geq 0 on [0,4][0, 4] and f(x)=0f(x) = 0 elsewhere.

04364x2dx=364[x33]04=364643=1\int_{0}^{4} \frac{3}{64}x^2\,dx = \frac{3}{64}\left[\frac{x^3}{3}\right]_0^4 = \frac{3}{64} \cdot \frac{64}{3} = 1 \checkmark

(b) The median mm satisfies 0mf(x)dx=0.5\int_{0}^{m} f(x)\,dx = 0.5:

0m364x2dx=364m33=m364=0.5\int_{0}^{m} \frac{3}{64}x^2\,dx = \frac{3}{64} \cdot \frac{m^3}{3} = \frac{m^3}{64} = 0.5

m3=32m^3 = 32

m=321/3=3.1748...3.17 (3 s.f.)m = 32^{1/3} = 3.1748... \approx 3.17 \text{ (3 s.f.)}

(c) Q1Q_1 satisfies 0Q1f(x)dx=0.25\int_{0}^{Q_1} f(x)\,dx = 0.25:

Q1364=0.25    Q13=16    Q1=161/3=2.520 (3 s.f.)\frac{Q_1^3}{64} = 0.25 \implies Q_1^3 = 16 \implies Q_1 = 16^{1/3} = 2.520 \text{ (3 s.f.)}

Q3Q_3 satisfies 0Q3f(x)dx=0.75\int_{0}^{Q_3} f(x)\,dx = 0.75:

Q3364=0.75    Q33=48    Q3=481/3=3.634 (3 s.f.)\frac{Q_3^3}{64} = 0.75 \implies Q_3^3 = 48 \implies Q_3 = 48^{1/3} = 3.634 \text{ (3 s.f.)}

IQR=Q3Q1=3.6342.520=1.1141.11 (3 s.f.)\text{IQR} = Q_3 - Q_1 = 3.634 - 2.520 = 1.114 \approx 1.11 \text{ (3 s.f.)}

(d) First, the true expected value:

E(X)=04x364x2dx=36404x3dx=364[x44]04=36464=3\mathrm{E}(X) = \int_{0}^{4} x \cdot \frac{3}{64}x^2\,dx = \frac{3}{64}\int_{0}^{4} x^3\,dx = \frac{3}{64}\left[\frac{x^4}{4}\right]_0^4 = \frac{3}{64} \cdot 64 = 3

E(X2)=04x2364x2dx=36404x4dx=364[x55]04=36410245=3072320=485=9.6\mathrm{E}(X^2) = \int_{0}^{4} x^2 \cdot \frac{3}{64}x^2\,dx = \frac{3}{64}\int_{0}^{4} x^4\,dx = \frac{3}{64}\left[\frac{x^5}{5}\right]_0^4 = \frac{3}{64} \cdot \frac{1024}{5} = \frac{3072}{320} = \frac{48}{5} = 9.6

Var(X)=9.69=0.6,SD(X)=0.6=LB3RB◆◆LB5RB0.7746\mathrm{Var}(X) = 9.6 - 9 = 0.6, \quad \mathrm{SD}(X) = \sqrt{0.6} = \frac◆LB◆3◆RB◆◆LB◆\sqrt{5}◆RB◆ \approx 0.7746

Now the grouped estimates. We need the class frequencies. Since we are modelling from the PDF, the expected frequency in each class (out of a large sample) is proportional to the class probability:

ClassMidpoint xxP(class)P(\text{class})xPx \cdot Px2Px^2 \cdot P
0x<10 \leq x \lt 10.5164=0.01563\frac{1}{64} = 0.015630.007810.00391
1x<21 \leq x \lt 21.5764=0.10938\frac{7}{64} = 0.109380.164060.24609
2x<32 \leq x \lt 32.51964=0.29688\frac{19}{64} = 0.296880.742191.85547
3x<43 \leq x \lt 43.53764=0.57813\frac{37}{64} = 0.578132.023447.08203

Estimated mean: xˉ0.00781+0.16406+0.74219+2.02344=2.93752.94\bar{x} \approx 0.00781 + 0.16406 + 0.74219 + 2.02344 = 2.9375 \approx 2.94

The true mean is 3. The grouped estimate underestimates by about 0.06, or roughly 2%. This error arises because the midpoint approximation assumes uniform distribution within each class, but the PDF is increasing, so the midpoint systematically underestimates the class mean for each class.

Estimated variance:

E(X2)0.00391+0.24609+1.85547+7.08203=9.1875\mathrm{E}(X^2) \approx 0.00391 + 0.24609 + 1.85547 + 7.08203 = 9.1875

Estimated Var9.18752.93752=9.18758.6289=0.5586\text{Estimated Var} \approx 9.1875 - 2.9375^2 = 9.1875 - 8.6289 = 0.5586

Estimated SD0.55860.747\text{Estimated SD} \approx \sqrt{0.5586} \approx 0.747

The true SD is 0.775, so the grouped estimate underestimates by about 3.6%. The grouped frequency approach loses precision because it replaces the continuous distribution with a discrete approximation within each class.