Stat Labs: June 2006

Reading Chapter 1 of "Stat Labs" by Nolan and Speed.

All work done in R.

Start by loading the data.

----

The Histogram

Histogram follows naturally from a tabular description of data.

> table(bab2$height[bab2$height!=99]) # 99 means unknown

gives

53 54 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72
1 1 1 1 10 26 55 105 131 166 183 182 153 105 54 20 13 6 1

Mother's height

How does one force the hist function to display Percent per inch instead of the usual density or frequency? I could not find a nice way to do it. Instead I had to call hist first and store its results into a list called h. Then I changed "counts" to 100*density and plotted the new list "h" using plot, which plotted as desired. Here is the code.

Cigarettes smoked per day

The histogram in Figure 1.4 looked hard. Took help from pp.61-62 of Dalgaard in writing code.

Infant birth weight

Easy. Modified the image within OpenOffice.org impress.

----

Five-number Summary

# remove the unknowns first
> isUnknown = (bab2$weight == 999)
> fivenum(bab2$weight[!isUnknown])

[1] 87.0 114.5 125.0 139.0 250.0

----

Box-and-Whisker Plot

Still do not understand it completely. For example:

> bstat = boxplot(bab2$weight[!isUnknown], plot=F)$stats
> bstat[2]
[1] 114.5

which is same as

> fivenum(bab2$weight[!isUnknown])[2]
[1] 114.5

but is different from

> summary(bab2$weight[!isUnknown])[2]
1st Qu.
114.8

> quantile(bab2$weight[!isUnknown])[2]
25%
114.75

Following Dalgaard's advice (Introductory Statistics in R, p. 65) I searched in ?boxplot.stats() and there it is:

The two "hinges" are versions of the first and third quartile, i.e., close to 'quantile(x, c(1,3)/4)'. The hinges equal the quartiles for odd n (where 'n <- length(x)') and differ for even n. Where the quartiles only equal observations for 'n %% 4 == 1' (n = 1 mod 4), the hinges do so _additionally_ for 'n %% 4 == 2' (n = 2 mod 4), and are in the middle of two observations otherwise.

There is more complex stuff in that help doc.

Any way, I drew Fig 1.6 and added some description of my own.

----

The Normal Curve

The 68-95-99.7 rule. And more.

So, I draw a curve using the formula and color the area (1SD from each side of mean).

But those skewness and kurtosis formulas are even more difficult!

In any case, simulated normal distribution 1000 times (with 484 samples), calculated kurtosis values for each simulation, and drew a histogram from these values. Overlaied the current Kurtosis value 2.975698 (red line) on the histogram.

----

Understanding the quantiles

I have a set of measurements (I call this set A):
175 110 1 341 312 294 420 202 130 166 199 424 392 353 300 264
143 152 498 355 227 218 299 193 326 379 540 344 296 278 265 161
223 434 353 58 322 198 267 76 271 407 386 304 342 439 334 261
417 126 446 426 106 157 414 236 309 370 410 389 332 416 68 68
241 290 275 135 321 434 343 155 463 267 320 180 150 185 313 407
247 249 252 159 247 383 225 311 252 156 334 384 165 275 333 189
289 167 271 201

Sorting the data gives:
1 58 68 68 76 106 110 126 130 135 143 150 152 155 156 157
159 161 165 166 167 175 180 185 189 193 198 199 201 202 218 223
225 227 236 241 247 247 249 252 252 261 264 265 267 267 271 271
275 275 278 289 290 294 296 299 300 304 309 311 312 313 320 321
322 326 332 333 334 334 341 342 343 344 353 353 355 370 379 383
384 386 389 392 407 407 410 414 416 417 420 424 426 434 434 439
446 463 498 540

Median of the data: 276.5

Which means half of the data lie below this cut-off number.

Five-number summary gives us:

Minimum 1st Quartile Median 3rd Quartile Maximum
1.0 192.0 276.5 353.0 540.0

A quarter of the data lie below 192 (lower quartile)
Three-quarters of the data lie below 353 (upper quartiles)

From p.15 of Nolan and Speed (2000) “median, lower and upper quartiles are examples of quantiles”.

More quantiles:

0% 10% 20% 30% 40% 50% 60% 70% 80% 90% 100%
1.0 142.2 166.8 213.2 252.0 276.5 311.4 336.1 383.2 417.3 540.0

These numbers give an idea of the distribution of the data. How to compare the distribution of two sets of data. Let us take another set of measurements (say, B)

291 284 363 364 252 227 325 459 445 344 400 294 334 117 424 325
220 376 413 340 287 222 1 264 316 192 334 404 331 334 173 454
243 251 236 420 194 299 330 275 156 226 442 94 269 372 392 317
247 226 299 476 223 416 338 160 315 276 324 261 219 290 263 320
121 448 330 165 368 316 158 317 133 406 343 330 351 312 370 297
289 226 298 413 434 354 455 388 370 368 206 254 161 296 71 245
436 168 288 500

And its quantiles

0% 10% 20% 30% 40% 50% 60% 70% 80% 90% 100%
1.0 164.6 225.4 253.4 288.6 313.5 330.0 346.1 378.4 425.0 500.0

We could then compare the two distributions in a quantile-quantile scatter plot (qqplot).

But this only has 10 points. A real qqplot has many more points (much finer resolution of the quantiles, see below).

----

This line produces Figure 1.9.

> qqnorm(bab2$weight[bab2$weight!=999], xlim=c(-4,4), ylim = c(50,250), pch=".", xlab="Quantiles of standard normal", ylab="Weight (pounds)")

Output looks similar to that from:

> wt = bab2$weight[bab2$weight!=999]
> plot(x=sort(rnorm(length(wt))), y=sort(wt))

For Fig. 1.11

> qqplot(x=runif(length(wt)), y=wt, ...)

----

On p.17 it reads:

[quantile-quantile plots] compare two sets of data to each other by pairing their respective sample quantiles. Again, a departure from a straight line indicates a difference in the shapes of the two distributions. When the two distributions are identical, the plot should be linear with slope 1 and intercept 0 ...

Let us start with two datasets of 1000 numbers each (zero mean and unit variance).

> x = rnorm(1000, mean=0, sd=1)
> y = rnorm(1000, mean=0, sd=1)

There is a function called qqplot which accepts the two data sets, calculates the quantiles and plots one against the other. By having the histograms on the margins of the quantile-quantile plot, we understand something about qqplot. I wrote a function qq.hist (derived from the help doc for layout) to do this.

> qq.hist(x,y)

Because the both X and Y have the same distribution, the quantile-quantile plot is (approximately) straightline passing through zero and with slope 1.

Suppose we have a third data set z (with higher variance).

> z = rnorm(1000, mean=0, sd=3)
> qq.hist(x,z, ylab="Z")

For small increments of X, there are larger increments of Z (non-unit slope).

Because both have the same mean, the line passes through origin.

Let us now have a distribution with long right-tail.

> z=rnorm(500)
> for(i in 1:10) {
+ tmp = rnorm(100,mean=i-1, sd=i)
+ z=c(z, tmp[tmp>0])
+ }
> qq.hist(x,z, ylab="Z")

----
To produce Figure 1.12:

> isSmoker = (bab2$smoke == 1)
> isNonSmoker = (bab2$smoke == 0)
> unknownWt = (bab2$weight == 999)
> qqplot(bab2$weight[isNonSmoker & !unknownWt], bab2$weight[isSmoker & !unknownWt], xlab="Nonsmoker's weight (pounds)", ylab="Smoker's weight (pounds)", pch=".", xlim=c(80,260), ylim=c(80,220), bty="n", xaxp=c(80,260,9))
> abline(0,1)

----

Stat Labs

About Me

Thursday, June 22, 2006

Maternal Smoking and Infant Health I