Sanderson M. Smith

Careless Use of Statistics: An Example

The El Batidor is the Cate School student newspaper. In the April 6, 2000 issue, the following article appeared.

 Cate Primary Inlight of the recent Presidential Primary El Bat conducted its own schoolwide primary. For the special El Batidor Poll, 128 students and faculty voted voluntarily at lunch. Margin of error plus/minus 2 percentage points. Here are the results: Bill Bradley: 32% Al Gore: 29% John McCain: 16% George W. Bush: 16% Alan Keyes: 7%

What appears is a copy of an e-mail note I sent to my Advanced Placement Statistics students.

This recent issue of the El BAT provides a wonderful statistical learning opportunity. I refer to page 4, and the article called "Cate Primary." I'm very serious... this presents an outstanding opportunity to learn about something that we have discussed ... and that will be important for the AP Examination.

Please take the criticism to be constructive. This is a fabulous learning opportunity. (Have I said that enough?) And, you are now sophisticated enough to understand what I am going to write. I don't know when the article was written or reviewed, but I am sure it was before we studied the fairly recent topic of survey sampling.

Please take the time to look at the article.... and then see if the following makes some sense to you. If it does, I will be the happiest person in the world. If you come by my office and tell me (honestly) that you understand what I write below, I'll give you a piece of candy. That's how excited and happy I will be!

OK, here we go....

The article reports a margin of error of plus/minus 2%. As polished statisticians, you know that would require a sample size of approximately 1/(.02)2 = 2500. The article states that the sample size is 128.

If you have a random sample of 128 (and I am pretty sure that the sample discussed in the article is not random), the margin of error calculated from the sample results would be (1.96)÷((.32)(.68)/128)) = .0808, or about 8% . This is considerably different than the reported 2%.

The computation of a margin of error is meaningless in this survey since the sample is not random. The sample was, I am sure, a convenience sample. That is, I am assuming that the voting was done by people who happened to be available at the time the poll was taken. (If the ballots were put in mailboxes, and if those who chose to do so turned them in, the sample would be a voluntary response sample.)

Even if the sample of 128 was random, the computation of margin of error would be relatively meaningless. Recall that for meaningful results, the population must be at least ten times the sample size. (Let me quickly state that there is nothing magic about the number ten. It is pretty much a figure that professional statisticians say provides a good dividing point between useful and non-useful. It is a rule of thumb.)

Let me take an extreme example to demonstrate why the computations produce relatively meaningless results if the sample size is too large relative to the population. Suppose, in the Cate Primary, all Cate community members voted, and let's assume that there are 300 community members. ( In this situation, our sample is the entire population.) Let's also assume that 32% (the El Bat figure) of the 300 said they would vote for Bill Bradley. Can we do a margin of error computation? Sure we can. Let's do it. The computation yields

(1.96)÷((.32)(.68)/300)) = .0527

If we round out, we have a margin of error of 6%. AH, as statisticians we then envision a 95% confidence interval (.26, .38). And, we know that the interpretation on this is that there is a 95% chance that we have an interval that contains the proportion of the population who will vote for Bradley.

NOW, LET'S THINK!!!

How useful is the confidence interval (.26,.38)?

Answer: In this case, it is absolutely useless.

WHY?

Simply because we know the population parameter. It is 32%. Remember, in this extreme case, our sample is the entire population. So, why in the world would we use inference techniques to try to gain some meaningful information about a parameter when we already know the exact value of the parameter?

Not the greatest of analogies, but think of this. If you ask me for your test score on an AP Statistics Test, I might tell you it is 86. OK, now you know something. If, after you know your score, I now tell you that your score is between 80 and 90, I have given you worthless information.

Now, back to the Cate Primary. If the sample of 128 was randomly chosen, this represents about 42% of a population of 300. Sure you can calculate a margin of error, and it would be the 6% calculated above. But, since your sample size is large relative to the population, the 32% (a statistic) would be a reasonably good estimate of the population parameter and chances are great that the population parameter would be relatively close to this figure. Sure, you can produce the 6%, but the 6% spread in either direction from 32% is probably too large given the size of the sample relative to the population.

What's the point here? It is simply that you don't consider calculating a margin of error using the formula we have studied for a random sample that is relatively large compared to the population. (Review: Rule of thumb says population should be at least ten times larger than the random sample.) If the sample size is large relative to the population, a wise thing to do is to simply use the sample statistic as an estimate of the population parameter, realizing that you would have to allow for a bit of margin on either side of the statistic. (A margin of error calculation would probably produce an unrealistically large confidence interval.)

OK, AP STATers... If you come by my office and tell me (honestly) that you understand all that is written above, you get a piece of candy.

MATH POWER TO ALL.