Some Thoughts Relating to the AP Statistics Exam

SOME THOUGHTS RELATING TO THE
ADVANCED PLACEMENT STATISTICS EXAMINATION

This portion is directed at students:

o Relax, and think. Remember that everyone else taking the exam is in a situation identical to yours. Realize that the problems will probably look detailed compared to problems you have encountered in other math courses.

o Make sure your calculator is functioning properly. Insert new batteries a day or so before the exam, and make sure all systems are "go." Bring a backup calculator, if possible.

o Read problems carefully. Bring colored pencil to highlight key words and phrases as you read the questions.

o Don't confuse median and mean. They are both measures of center, but, for a given data set, they may differ by a considerable amount.

* mean > median <===> distribution skewed right

* mean < median <===> distribution skewed left

o Don't confuse coefficient of correlation and slope of least-squares regression line.

* A slope close to 1 or -1 doesn't mean strong correlation.

* An r value close to 1 or -1 doesn't mean slope of least-squares regression line is close to 1 or -1.

* Relation between b (slope of regression line) and r (coefficient of correlation) is b = r(S_y/S_x). This is on formula sheet provided for the exam.

* Remember that r² > 0 doesn't mean r > 0. For instance, if r² = 0.81, then r = 0.9 or r = -0.9.

o Remember that the least-squares regression line contains the point (mean x, mean y), where mean x is the mean of the x-values, and mean y is the mean of the y-values.

o A coefficient of correlation near 0 doesn't necessarily mean there are no meaningful relationships to be observed between the two data sets.

x

2

3

4

5

6

7

8

9

10

11

12

y

6

30

8

50

10

70

12

90

14

110

16

In this case, r = .38, but a scatterplot displays something quite interesting. Moral of story: Whenever possible, look at the "shape" of the data.

o Be careful with the concept of simple random sample (SRS). For instance, if each individual in a group has an equal probability of being chosen in a sample, it doesn't follow that the sample is an SRS. Consider a class of 6 boys and 6 girls. I want to randomly pick a committee of two students from this group. I decide to flip a coin. If "heads," I will choose two girls by a random process. If "tails," I will choose two boys by a random process. Now, each student has an equal probability (1/6) of being chosen for the committee. However, the chosen two students do not represent an SRS of size two picked from members of the class, for the selection process does not allow for a committee consisting of one boy and one girl. To have an SRS of size two from this class of 6 boys and 6 girls, each committee of two students would have to have an equal probability of being chosen.

o Look at graphs and displays carefully. For graphs, note carefully what is represented on the axes, and be aware of number scale. Some questions that provide tables of numbers and graphs relating to the numbers can be answered simply by "reading" the graphs.

o Don't confuse standard deviation and variance. Remember that standard deviation units are the same as the data units, while variance is measured in square units.

o Be aware of numerical statistical changes when transformations are made on a data set, W.

* Adding the same number to each number in W simply shifts the data. This doesn't change standard deviation and variance.

* Multiplying all numbers in W by a constant does change standard deviation and variance. For instance, if all members of W are multiplied by 4, then the new set has a standard deviation that is 4 times larger than the standard deviation of W, and a variance that is 16 times the variance of W.

Simple examples:

Set S_x

Mean

St. Dev.

Variance

Range

{1,2,3,4,5}

3
1.414
2
4

Add 7 to each element of S_x, creating set S_x+7.

Set S_x+7

Mean

St. Dev.

Variance

Range

{8,9,10,11,12}

10
1.414
2
4

Multiply elements of S_x by 4, creating the set S_4x.

Set S_4x

Mean

St. Dev.

Variance

Range

{4,8,12,16,20}

12
5.6569
32
16

Multiply elements of S by 4, then add 7, creating the set S_4x+7.

Set S_4x+7

Mean

St. Dev.

Variance

Range

{11,15,19,23,27}

19
5.6569
32
16

o Be aware of, but be careful with statements (a) and (b) since they represent simplified versions of sophisticated concepts.

(a) When combining two independent sets by addition,

-means add;

-standard deviations do not add;

-variances add.

(b) When combining two independent sets by subtraction,

-means subtract;

-standard deviations do not subtract;

-variances add.

Simple examples:

Let S = {5, 9} and T = {1,3}.

Then set S+T = {5+1,5+3,9+1,9+3} = {6,8,10,12}, and

set S-T = {5-1,5-3,9-1,9-3} = {2,4,6,8}.

Set S

Set T

Set (S+T)

Set (S-T)

Mean

7
2
9

5

St. Dev.

2
1
2.2361

2.2361

Variance

4
1
5

5

Note that:

mean(S+T) = mean(S) + mean(T)

mean(S-T) = mean(S) - mean(T)

variance(S+T) = variance(S-T) = variance(S) + variance(T)

o Recognize a binomial distribution situation when it arises. Thinking in terms of slots, if you have a set number of slots, and the probability of getting a "success" in each slot is constant, then you have a binomial setting. Consider, for instance, rolling a die ten times. There are ten slots to be filled, and the probability of filling any slot with the outcome "6" is 1/6.

Using the TI-83, the probability of getting exactly three 6's is

(₁₀C₃)*(1/6)³*(5/6)⁷

= binompdf(10,1/6,3) = 0.155045, or about 15.5%.

The probability of getting less than four 6's is

binomcdf(10,1/6,3) = 0.93027, or about 93%.

The probability of getting four or more 6's in 10 rolls of a single die is about 7%.

If x is the number of 6's obtained when ten dice are rolled, then

mean(x) = 10(1/6) = 1.6667, and

st.dev(x) = sqrt[10(1/6)(5/6)] = 1.1785

Another example:

Assume a large population is 32% Hispanic. If a random sample of 15 people is chosen, this can be represented by a binomial model with 15 slots. The probability of "success" for each slot is 0.32.

The probability that this sample would contain at least 5 Hispanics is

1 - binomcdf(15,.32,4) = 1 - 0.4477 = 0.5523, or about 55%.

If x represents the number of Hispanics in a random sample of size 15, then

mean(x) = 15(.32) = 4.8, and

st.dev(x) = sqrt[15(.32)(.68)] = 1.8067

o Binomial distribution ---> normal distribution as number of trials increases. If N is the number of trials in a binomial setting, and if p represents the probability of "success" in each trial, then a general rule of thumb states that a normal distribution can be used to approximate the binomial distribution if Np is at least 5 and N(1-p) is at least 5.

o Recognize a discrete random variable situation when it arises (and don't confuse it with a binomial situation.)

Let x = the number of heads obtained when five coins are tossed.

Value of x

0
1
2
3
4
5

Probability

1/32=.03125

5/32=.15625

10/32=.3125

10/32=.3125

5/32=.15625

1/32=.03125

mean(x) = 0(.03125) + 1(.15625) +2 (.3125) + 3(.3125) +4 (.15625) + 5(.03125) = 2.5.

var(x) = .03125(0-2.5)²+ .15625(1-2.5)²+ .3125(2-2.5)² +.3125(3-2.5)²+ .15625(4-2.5)² + .03125(5-2.5)² = 1.25.

st.dev(x) = sqrt[var(x)] = sqrt(1.25) = 1.118.

o Simpson's Paradox:

This usually involves percentages.

Example:

WIN

TOTAL

% WIN

WIN

TOTAL

% WIN

A: First Half

80

100

80%

B: First Half

78

100

78%

A: Second Half

20

40

50%

B: Second Half

2

5

40%

WIN

TOTAL

% WIN

A: Both Halves

100

140

71.4%

B: Both Halves

80

105

76.2%

In this example, A's winning percentage exceeds B's for both of two periods, but B has a better overall winning percentage.

--------------------------

o Realize that logarithmic transformations can be practical and useful. Among other things, taking logs cuts down the magnitude of numbers. Also, if {(x,y)} has an exponential pattern, then {(x,log y)} has a linear pattern.

Example:

x

y
log y

1

24

1.3802

2

192

2.2833

3

1,536

3.1864

4

12,188

4.0859

7

6,290,000

6.7987

8

49,900,000

7.6981

An exponential fit to (x,y) on the TI-83 yields y = 3(8^x), with r = .9999. If we attempt to extrapolate and predict a value for y when x = 9, we get y = 3(8⁹) = 402,653,184.

A linear fit to (x,log y) on the TI-83 yields log y = .9027286x + 0.477395, with r = .9999. If x = 9, then log y = .9027286(9)+0.477395 = 8.6019524. Hence y = 10^8.6019524 = 399,900,917.

o Types of errors:

Type I error: Rejecting a null hypothesis when it is true.

Type II error: Accepting a null hypothesis when it is false.

Power of a test: Probability of correctly rejecting a null hypothesis = 1 - Probability (Type II error).

Simple example:

Population #1: A A A A A A A A B B

Population #2: A B B B B B B B B B

Without knowing which of the populations is represented, an element is randomly chosen. After viewing the element, the chooser must guess the population from which it came.

Null hypothesis (H_o): The element came from population #1.

Alternate hypothesis (H_a): The element came from population #2.

Test decision: Accept H_o if the element is A; otherwise reject H_o and accept H_a.

Here is a probability chart:

Ho TRUE

Ho FALSE

ACCEPT Ho

80%

10% (Type II error)

REJECT Ho

20% (Type I error)

90% (Power of the test)

o In hypothesis testing, the level of significance is the probability of making a Type I error.

o Thoughts on multiple choice statistics questions.

* Relate to the question. What topic is being referenced?

* Read carefully. Bring colored pencil to highlight key words and phrases. After deciding on an answer choice, glance at the highlighted words and phrases to make sure you haven't made a careless mistake or an incorrect assumption.

* Realize scoring is (Number Right) - (1/4)(Number Wrong). Careless mistakes hurt.

* You don't have to answer all of the questions to get a good overall score.

* If an answer is "obvious," think about it. If it's so obvious to you, it's probably obvious to others... and the chances are good that it is not the correct response. For example, suppose one set of test scores has a mean of 80, and another set of scores on the same test has a mean of 90. If the two sets are combined, what is the mean of the combined scores. The "obvious" answer is 85 (and will certainly appear among the answer choices), but you, as an intelligent statistics student, realize that 85 is not necessarily the correct response.

* If a question and/or answer choice set appears to be detailed and you need to do a lot of reading to reach a conclusion, most of the answer choices will probably be obviously incorrect. Don't be frightened off by questions and/or answer set choices that seem to be wordy. Just read carefully, and use the highlighting technique previously mentioned.

* If you can eliminate one or more of the answer choices, you should respond, even if you have to guess from the remaining choices.

o Thoughts on free response questions.

* Read carefully, sentence by sentence, and use colored pencil to highlight key words or phrases.

* Relate to the problem. Decide what statistical concept/idea is involved. This will allow you to make an intelligent approach to questions asked. If you get started on an intelligent path, you will probably get some points even if you make some mistakes along the way.

* Be neat, Make it clear to the reader what you are attempting to do. However, don't write too much. Overkill can waste valuable time.

* Questions may well look very detailed. You may be given much more information than you actually need. This is likely to be true if you are shown a computer printout. Don't get flustered by the way a problem "looks" when you first glance at it. The 1997 AP Exam provides good examples of problems that look scary, but which are really quite reasonable if you remain level-headed.

* Some questions may give you considerable leeway in choosing an approach to a solution. Consider your options carefully and take the one that requires the least amount of time.

* Don't be calculator-inefficient. It is certainly possible to waste time punching numbers into a calculator. Entering lists of numbers into a calculator can be time-consuming, and certainly doesn't represent a display of statistical intelligence. If, upon reading an AP question, you think you will have to enter many numbers into a calculator, you are probably overlooking something. Reread the problem, and look for a quicker path to a solution.

=======================================================

This portion is directed at both students and teachers:

What follows is an attempt to summarize some of the thoughts, observations, and ideas presented in an excellent set of notes prepared by Diann Resnick (Bellaire High School, Houston, Texas) in an e-mail note (7/18/99) sent to the AP Statisics ListServe.

When AP instructions say "Give appropriate statistical evidence to support your conclusion," or "Justify, using statistical evidence," this means that the student should conduct formal hypothesis testing. This includes:

1. Stating the hypothesis in context of the problem.
2. Naming the test used and why it was used, and checking (not just naming) the conditions or assumptions for the test used. A rough sketch of the "shape" of the data might be helpful here.
3. Carrying out the mechanics of the test and giving a numerical test statistic and a p-value.
4. Writing the conclusion. The test statistics must be linked to the conclusion. Example: "Since the p-value is so small (alpha < .05), I reject the null hypothesis and conclude that there is no association between hikers ability and direction traveled when lost."

Students should know how to read generic computer output.

Students should know difference between a scatter plot and a residual plot.

Students should realize that a model that produces predicted values isn't providing actual data values. If the equation for a least-square regression line is y = 1.5x + 3.34, then the slope and y-intercept need to be interpreted properly. For instance, one might say that "on the average, a unit change in x results in a change of 1.5 units in y" and that "the predicted value of y is 3.34 when x = 0."

Students should not be sloppy in choice of words. For instance, on a residual graph, the phrase "half are above and half are below" is not equivalent to "randomly scattered."

Students should give answers that make sense in the context of the problem. For instance, it generally makes no sense to talk about "1/3 of an airplane."

Students need to define symbols that they introduce.

Students need to realize that null and alternate hypotheses are stated in terms of population parameters, not sample statistics. Also, students need to be careful not to reverse the null and alternate hypotheses.

Students need to know the distinction between a test for homogeneity of proportions and one for independence.

When given data, students should actually check necessary assumptions instead of just saying something like "it is assumed....". For instance, in a chi-square test where cell counts are known, if all expected counts are greater than or equal to 5, this should be noted, as contrasted to just stating the assumptions for chi-square.

Students should interpret p-values correctly.

Students should understand the difference between a simple random sample and the random assignment of treatment to subjects.

Students should understand that there are two types of replication in experiments: (1) Replication within the experiment quantifies variablility within the experiment, and (2) replication of the experiment helps achieve validation.

It is important to understand terms like confounding, lurking variables, etc.

Students need to be careful when using "calculator language." It is important for a reader to understand what is written and feel that the student really knows and understands what he/she wrote as a response to a problem.

===========================================

RETURN TO WRITINGS HOME PAGE

Previous Page | Print This Page

Set S_4x+7	Mean	St. Dev.	Variance	Range
{11,15,19,23,27}	19	5.6569	32	16

	Set S	Set T	Set (S+T)	Set (S-T)
Mean	7	2	9	5
St. Dev.	2	1	2.2361	2.2361
Variance	4	1	5	5

Value of x	0	1	2	3	4	5
Probability	1/32=.03125	5/32=.15625	10/32=.3125	10/32=.3125	5/32=.15625	1/32=.03125

	WIN	TOTAL	% WIN		WIN	TOTAL	% WIN
A: First Half	80	100	80%	B: First Half	78	100	78%
A: Second Half	20	40	50%	B: Second Half	2	5	40%

	WIN	TOTAL	% WIN
A: Both Halves	100	140	71.4%
B: Both Halves	80	105	76.2%

x	y	log y
1	24	1.3802
2	192	2.2833
3	1,536	3.1864
4	12,188	4.0859
7	6,290,000	6.7987
8	49,900,000	7.6981

	Ho TRUE	Ho FALSE
ACCEPT Ho	80%	10% (Type II error)
REJECT Ho	20% (Type I error)	90% (Power of the test)