Simpson's Paradox: A Simple Example

Sanderson M. Smith

The analysis below was prepared by Diann C. Resnick, Bellaire High School, Bellaire, Texas. It appears in Herkimer's Hideaway with permission from Diann.

COMMON MISTAKES ON THE

2002 AP STATISTICS EXAM

This year, as in past years, the 2002 statistics exam reading was quite an educational experience. The best part of the reading was seeing how well written some of the student's papers were. These papers clearly showed that good teaching and learning was happening in the classroom. With the increase in exams (about 50,000 exams this year) there were wide variations in responses and level of preparedness of the students. Listed below are general comments for the exam in general and then comments question by question.

General Comments:

Many Students:

o seemed to have difficulty recognizing the difference between a population and a sample.

o were very sloppy with statistical notation and definitions. It was not uncommon to see students use creative notation p(hat) to represent a population proportion) and use the word "mean" to represent proportions or p/(pi) to represent a mean.

o anticipated test questions and often answered not the question asked, but what they thought the question was or wanted it to be. Many students did not read the questions carefully.

o seemed to think that the questions were meant to be tricky and therefore, tried to be creative when a straightforward answer was best.

o wrote more than was necessary to answer a question. It often appeared that some students were not sure of their answer so they added extraneous material. In doing so, they often wrote incorrect statements and were either penalized for the extraneous incorrect statements, or the statements were considered parallel solutions. In the case of parallel solutions, the worst of the two answers is graded and many students lost credit for a problem.

o did not proof read their answers. They often left out a critical word in their answer or wrote contradictory statements. If students had taken the time to reread their work, they might have caught these careless mistakes.

o still had difficulty in budgeting their time on the test. They failed to leave about 30 minutes for Question #6 - the question that counts 25% of the free response section.

o had difficulty in interpreting graphs. They seem to think that all graphs need to be discussed in terms of "center, shape, and spread," and did not look at the graphs in the context of the problem or look at what the question was asking.

Question 1: (Einstein's and Newton's Theory of Gamma)

A great deal of interpretation and communication was needed to successfully answer question 1. The students generally did a good job of communication, and there were many good, short, concise, responses.

Some students:

o confused statistical terminology in critical places. This often resulted in their responses being either incorrect or ambiguous.

o used the term "margin of error," incorrectly. They did not seem to understanding that the margin of error is a number and used the term to represent an interval. This is an incorrect interpretation and was counted against them in the grading of their answer.

o were unsuccessful in distinguishing between estimate and margin of error and the idea of an interval as a set of likely values of an estimate.

o in part (a) used the terms "experimental values," "observations," "data," and "estimates" interchangeably. It was not clear from the context of their response whether the experimental values were estimates, margins of error, or something else.

o used the terms "observations" and "data" in a generic sense and did not seem to think of these words in statistical terms. This might be due to the lack of practice in using precise language in the classrooms or a lack of understanding that the graphic information was referring to sets of estimates rather than sets of data.

o looked at the graph and interpreted the interval to be some type of a boxplot. The student thought the estimate and margin of error represented data and variability of the data.

In parts (b) and (c) many students:

o focused on the point estimates as carrying more information than the interval estimates, and in some cases, ignored the intervals completely in their assessment of the evidence for or against a particular theory.

o seemed to feel that the converging behavior of the estimates was enough to justify one or the other theory. The student did not consider the necessity for evaluating the uncertainties in those estimates.

o used the statistical terms "error," " range," and "variability" incorrectly. Their use was frequently ambiguous in the context of the problem and it was unclear whether the student was referring to data or to a set of estimates.

o used the terms "impossible", "certain," and "proved" incorrectly. In scientific and statistical arenas such levels of certainty are generally unacceptable. Those terms should not be used in data analysis and generally avoided unless discussing theorems of mathematical statistics. In the grading, students were penalized for making a statement like "the graph provesÉ.".

o looked at the graphical display and interpreted the question to be about regression and/or the law of large numbers. It was not unusual for a student to think that the graphical display was a residual plot.

Question 2: Design and Randomization (The Boot Problem)

In part (a), many students:

o provided a diagram with no explanation, even though, the problem specifically stated, "Include a few sentences on how it (the design) would be implemented."

o did not use an incorrect design, but did not use one that was as good as the paired or crossover design.

o described a Completely Randomized Design with two treatment groups as their design method. Although this is a correct design, it is not as good as the paired (both treatments on one subject) or crossover designs.

o suggested a paired design in which pairs of "similar" subjects would be grouped. Students did not receive full credit for this approach.

o identified potential blocking variables such as gender, occupation, climate, etc. Then they randomly assigned treatments to subjects within the blocks. While this indicates a high level of statistical thinking, it is not quite as good as a paired or crossover design.

o described one of the two designs that would constitute a complete answer (paired design or crossover design) but failed to discuss randomization at all.

o used the language of sampling in their descriptions; e.g., stratified samples, SRS and did not understand the difference between selecting a random sample and random allocation of subjects to treatments.

o incorrectly used the terminology or vocabulary of experiments; e.g., "allocate volunteers into two blocksÉ"

Implementation Issues:

Many students:

o described a random assignment into two groups but either did not identify the treatments or incorrectly identified the treatments that were to be compared.

o failed to understand that the design was to use the 100 volunteers given, but rather concentrated on randomly selecting volunteers from the population.

o failed to describe an appropriate randomized experiment to compare current and new treatments.

o used a "coin tossing" randomization scheme to assign subjects to treatments. This was accepted, but students did not recognize that this scheme is not as good as randomization schemes that assign an equal number of subjects to each treatment group.

o alphabetized the list of volunteers, numbering the names on this list of volunteers from 1 to 100, and then assigning the even numbered names to Group 1 and the odd numbered names to Group 2. These students did not recognize that this is not a method of randomization.

o described incomplete randomization schemes. For example, "randomly allocate volunteers into two groups"; or "randomly assign volunteers into two groups using an SRS" without any description of the randomization process.

o assigned numbers to boots or subjects and mentioned a random digit table but failed to explain or describe the random formation of treatment groups.

In part (b): Double Blinding - Many students:

o indicated that they understood that double blinding involved having two parties unaware of the treatment assignments; however:

(a) identified the volunteers as one party and someone other than the evaluator as the second party. Many students used words like "administrator", "conductor", "manufacturer" as the second party. These words did not adequately convey the idea that the evaluator was the required second party.

(b) identified that the second party should be the evaluator but stated that this was not possible when, in reality, it was possible.

(c) identified that the second party should be the "distributor" of the boots. The student did not understand "distributor" and "evaluator" were not the same.

o failed to identify the volunteers (subjects) as needing to be kept unaware of treatment assignment .

o stated that there was no need for blinding since the subjects were randomly assigned into treatment groups.

Question 3: Probability (New High School Runners)

In part (a), students often:

o used a 2-sided analysis for the "2 standard deviation" argument (and so claimed <5% rather than <2.5%).

o claimed the event was "unlikely" based on more than 2 standard deviations from the mean but failed to invoke normality.

o claimed that random variables can not be more than one standard deviation below the mean.

o tried to turn this problem into an inference problem. Most often, they believed (at problem's end) that they had done a test. Occasionally, they believed that they had constructed a confidence interval for an unknown mean.

In part (b), students often:

o did not know how to compute s for the team.

o confused the team time with the average runner's time, i.e. divided by 4 to get 4.725.

o failed to correctly carry results from part (b) into part (c).

o calculated the probability from the wrong tail of the distribution in parts (a) and/or (c).

For part (c), students often interpreted the team time <18.4 to mean "equal to or less than" 18.3 or "equal to or less than" 18.39.

Question 4: Regression (Airplane Operating Costs and Passenger Seats)

Overall, students demonstrated satisfactory understanding of scatterplots, correlation, and computer output for linear regression. They were able to write the equation of the least squares regression line and to determine the correlation coefficient from the information provided in the computer output. Many students seemed unsure about how to interpret correlation. Some tried to explain correlation using the coefficient of determination, r². Few did so successfully. In part (c), most students observed that the given regression line would be a poor fit for the restricted data. The vast majority of them referenced the negative association among these points as their justification. A few commented on the pattern in the residuals over the 250 to 350 passenger seat range.

Part (a): Determining the equation of the least squares regression line from computer output.

Some students:

o could not interpret the computer output. Often the student misinterpreted the value s (standard deviation for the line) to represent the value for the slope of the line.

o did not define their variables carefully. For example, some used x = # of passengers or

y = operating cost per plane.

o did not include y(hat) in the regression equation. Of those who did, most did not define it correctly as the predicted operating cost per hour.

o treated the slope and the y-intercept as variables.

o wrote the equation of the least squares regression line as y = a + bx and did not recognize that the question was asking them to write the equation for the given data.

In Part (b): Calculating and interpreting the correlation coefficient, r.

Many students:

o thought that r² was the correlation coefficient.

o attempted to use adjusted r² from the computer output instead of r².

o included all four components of the correlation interpretation (strength, direction, form, and context ) in their responses.

o described r = 0.755 as "weak" or "fairly weak" or "extremely weak". This suggests that students have not encountered enough real data sets to recognize that this is a moderately strong value of r.

o wrote the value of r in terms of a percent.

o wrote numbers for r such as 7.55 or 4.02 and did not seem to recognize that the value of r must be a number between -1 and 1.

o who attempted to explain r² and did not do so correctly. Incorrect interpretations, such as " r² is the percent of data explained by the line", were common.

o often correctly explained the meaning of r but then gave an incorrect interpretation of r² . This was treated as a parallel solution and counted as incorrect.

o were careless in writing answers and made transcription errors, such as writing the correct value of r , 0.755 as .0755.

In part (c): Evaluating the quality of the given linear regression line over a restricted range.

Many students:

o made generic comments like, "Anytime you remove points, you will have to calculate a new regression line" rather than focusing on the specific context of the scatterplot provided.

o mistook the question to be asking them about the difference between predicting and extrapolating.

o did a very nice job at constructing a residual plot of the restricted data and then indicated that a negative correlation existed.

o often talked about influential points being removed from the graph, but did not describe what would happen to the relationship among data in the restricted domain.

o removed only the three points in the upper right-hand corner and not the lower two points.

In general many students wrote rambling explanations and misused statistical terminology.

Question 5: (Inference: Early Birds and Night Owls)

Many Students:

o failed to provide conditions or gave an incomplete set of conditions for using the selected statistical test.

o listed the conditions for using the selected statistical test, but did not check them.

o did not provide linkage between their computation and conclusion.

o failed to interpret their conclusion in context of the problem.

o did not read the question in part (b) carefully and tested their new hypotheses rather than the ones listed in the statement of part (a).

o did not seem to understand the question in part (a). They often gave two distinct sets of hypotheses either by repeating the hypotheses listed in the original statement or gave the same set of hypotheses twice just with a different arrangement of parameters.

o defined their hypotheses using improper notation. It was not unusual to see students use p(hat) for the notation of a parameter without clearly indicating that it was intended as a population measure.

o failed to identify the parameter (e.g., mean, median, or proportion) used in part (a). They gave statements such as "E is the early birds who recall no dreams".

o incorrectly described their conclusion using phrases such as "at the 95% confidence level, we reject the null hypothesis". These students did not seem to understand the difference between a confidence level and an alpha level.

o reversed the direction of the inequality in the alternate hypothesis - or wrote their alternative hypothesis as a two-tailed test. When students reversed the direction of the inequality, they did not seem to be able to recognize this error, even with a large p-value.

Question 6: (Investigative Task: Comedy Shows: S or F?)

Part (a) asked students to create and interpret a 95% confidence interval for a proportion.

Many students:

o failed to check the appropriate assumptions for this confidence interval.

o did not appear to understand that the interpretation of a confidence interval is meaningless unless the appropriate conditions have been satisfied.

o omitted the interpretation of the confidence interval even though the question specifically asked for it in part (a).

o gave the interpretation of the confidence interval in part (b). Frequently students incorrectly interpreted the interval as "95% of the population is between (0.517, 0.625)" or "95% of the time the proportion is in the interval (0.517, 0.625)." Students struggled with the interpretation of the confidence interval.

o did not write the meaning of the confidence interval in context of the problem.

Part (b) asked students to interpret the level of confidence.

Many students:

o gave the interpretation of the interval, (0.517, 0.625) requested in part (a) rather than an interpretation of the level of confidence, 95%.

o interpreted the level of confidence incorrectly in terms of the specific interval from part (a) - (0.517, 0.625). This often took the form such as, "95% of confidence intervals from repeated sampling would have a proportion in the interval (0.517, 0.625)." They did not seem to understand that repeated sampling produces different intervals.

Part (c) asked students to perform a hypothesis test to compare two proportions.

Many students:

o had difficulty with notation. They often stated their hypotheses in terms of sample statistics, rather than in terms of the population parameters.

o forgot to check all appropriate conditions.

o did a good job with computations and interpretation in context.

Part (d)

Many students:

o failed to recognize that the difference in sample sizes created an imbalance in the pooled estimate. Students who recognized the need to balance the sample size were generally successful.

==============================

Suggestions for Teachers

o Tests of inference involve more than just finding a number. That is only one part of a complete answer.

o Written communication is more important than numbers.

o Have students say what they need to say and then stop. Writing more is not necessarily better.

o Students should practice in class with tests that are similar in format with an AP Exam.

o Students should be given data in many different forms and formats. This might best be accomplished by giving students problems and examples from different books.

o Students need practice in reading computer output.

o Often students only see examples of bivariate data that has an r value greater than .9 or smaller than -.9. They should be exposed to data that has weaker correlation.

o It would be helpful for teachers to write a list of statistical words (range, mean, data, variance, etc.) that are often used casually by students. When used in context of a statistics problem, they should be used correctly.

o If a paper has two (parallel) solutions to a problem and one is not correct, the incorrect one is the one that is scored.

o It is helpful if students do statistics (design experiments, collect data, work on computer labs, etc.) through out the year, not just learn about statistics.

o Write tests with questions taken from many different sources. It forces students to think.

o It would be helpful if in designing experiments, that the design make sense or be practical.

o It is important that teachers and students have practice in grading papers holistically. One way this can be accomplished is by using old AP exams.

o Often in probability problems, pictures often help the student.

o Students often did not use their calculator in the most efficient manner. They did not seem to know how to use the calculator to perform statistical tests and made errors in the computation of a confidence interval or a Z value.

RETURN TO WRITING HOME PAGE

Previous Page | Print This Page