"If aman stood with one foot in an oven and the other foot in a freezer,statisticians would say that, on the average, he wascomfortable."

Quote Magazine, June 29, 1975

1.2 DESCRIBING DISTRIBUTIONS WITHNUMBERS (Pages 30-46)

OVERVIEW: A numerical summary of a data distributionshould somehow indicate its center and its spread. Theconcept of spread is important. As a simple example, sets A and Bboth have a mean of 50. However, the sets are very different in termsof spread.

A = {50,50,50,50,50,50,50,50,50,50} and B ={0,0,0,0,0,100,100,100,100,100}

Measures of center.

Mean (important to understand sigmanotation)
Median (the 50th percentile)
Mode (most frequent score. A data set can be multi-modal).

Quartiles [Q1(25th percentile), Median (50thpercentile), Q3(75th percentile)]
Range (Max. value - min. value)
Interquartile range (Q3 - Q1)

The Five-Number Summary (Min., Q1, Median, Q3, Max.)
It is very important to note that there are definite conventions forestablishing the "big five" for a numerical data set. For instance,the median for a data set is unique. You should understand how thevalues are determined for the following data sets.

 Set# Data Set Min. Q1 Median Q3 Max. Range IQ Range #1 1,3,5,20,22 1 2 5 21 22 21 19 #2 1,3,5,20,22,40 1 3 12.5 22 40 39 19 #3 1,3,5,20,22,40,40,50 1 4 21 40 50 49 36 #4 1,3,5,20,22,40,40,50,140 1 4 22 45 140 139 41

The "big five" are all that is needed to construct aboxplot (sometimes called a box-whisker plot) for adata set. Boxplots are useful when you have lots of data tosummarize, where displays like dotplots and stemplots becomeimpractical. A boxplot does not identify every individual piece ofdata, but rather summarizes the data by quartiles. A modified boxplotis frequently used to identify outliers. An outlier is definedto be a number that is more than 1.5 IQ ranges above Q3, or less than1.5 IQ ranges below Q1.

For example, in Data Set #4 above, the IQ Range is41.
1.5 x 41 = 61.5.
Q3 + 1.5(IQ Range) = 45 + 61.5 = 106.5. Since 140 > 106.5, thenumber 140 is an outlier, and it would be so-identified in a modifiedboxplot. (The TI-83 graphics calculator produces both types ofboxplots.)

Important to note:

• The mean is greatly influenced by an outlier; the median is not.
• The range is greatly influenced by an outlier; the IQ range is not.
• Q1 and Q3 are not influenced by an outlier.

A statistic is a number that is computed from a sample. Aparameter is a number that is computed from a population.Means, medians, IQ ranges, etc. could be statistics or parameters.

In statistics, the standard deviation is frequently a veryimportant measure of spread. The variance is the squareof the standard deviation. There are two different standarddeviations, depending on whether it is being computed form apopulation or from a sample.

A population standard deviation is designated by s, and it is a parameter.
A sample standard deviation is designated by s, and it is astatistic.

s and s are calculated slightlydifferently, as demonstrated below for the small data set W={10,20,30}. The mean of W is 20. (If a data set is large, thedifference between s and s very small.)

 Data (X) (X-20) (X-20)2 10 -10 100 20 0 0 30 10 100 TOTALS 0 200

If W is considered to be a population, then thestandard deviation and variance are, respectively,
s = sqrt(200/3) = 8.16495809 and s2 = 66.6666667.

If W is considered to be a sample, then the standard deviation andvariance are, respectively,
s = sqrt(200/2) = 10 and s2 = 100.

It's significant to note that units attached to a variance aresquare units, whereas the standard deviation has the same unitas the data itself.

For the present time we will be mostly concerned with the standarddeviation, s. Things to note:

• s measures spread about the mean. (The median is not used.)
• If s = 0, there is no spread. (In this case, all observations are identical.)
• s is influenced by outliers.
• s is most meaningful with data that has a symmetrical shape.
• If data is heavily skewed, s is not a particularly useful statistic.

Remember the sets A and B described in the OVERVIEW. Set A hasmean = 50, standard deviation = s = 0, and variance = s2 =0 (square units). Set B has mean = 50, standard deviation = s =52.705, and variance = s2 = 2777.817 (square units).

The phrase degrees of freedom is mentioned in this section.This concept will be important in future studies, but for now, anintuitive feeling for the phrase will be provided. Consider a set offive numbers with a definite sum (say 100), and hence, a definitemean (100/5 = 20). Note the table:

 Set of five numbers Sum Mean 12 18 5 43 x 100 20

Note that one could replace the four numbers 12, 18, 5, and 43with any other set of four numbers, and then "adjust" thevalue of x so that the sum of the five numbers is 100. That is, fourof the numbers can vary freely and, for each set of four numbers, thevalue x can be "adjusted" to preserve the sum of 100. In thissituation, we say that there are 5-1 = 4 degrees of freedom. If wehad a set of N numbers with an definite sum, then the degrees offreedom would be N-1.

 Note: Problem #1 in Section II of the 2001 Advanced Placement Statistics Examination involves the concept of an outlier and other statistical concepts introduced in Chapter 1. You can see a detailed solution to this actual AP problem by going to the home page of Herkimer's Hideaway, taking the link to WRITINGS AND REFLECTIONS, and then the link to item #67. Please note that this link does not contain a statement of the problem, just a detailed solution. The location does provide a link to the College Board, where you can get the problem statement if you don't have it available. (The College Board does not allow for the copying of an actual problem statement on a site such as this.)

Extremely important for success of the Advanced PlacementStatistics Examination:
If you are given a numerical data set, always (I repeat,always) display the shape of the distribution.
Using the TI-83, this can be done very easily with a histogram or aboxplot.

This photo shows a non-random sample of sixteenmembers of the Cate School community. The tall gentleman is Mr. JimMasker, a distinguished Cate history teacher. (He's not standing on astool. He is actually as tall as he appears.) Assume that we recordthe heights of the sixteen individuals and calculate the followingstatistics:

 Mean Median Standard Deviation Variance Range Interquartile Range 25th percentile 75th Percentile

How does Mr. Masker's height affect eachstatistic?