Sanderson M. Smith

# AP STAT THOUGHTS (JUST BEFORE AP EXAM)

=================

Experiment vs. observation.... some kind of treatment must be administered in an experiment.

A strong association between two variables does not mean one caused the other. (Association does not imply causation.) If you want to show causation, this would require an experiment wherein you attempt to control variables that might confound the situation.

Least-squares regression line:

• Don't confuse slope of line and coefficient of correlation r. They are related, but they are not the same thing.
• Influential point vs. outlier on scatterplot. Careful here...a point can be an outlier, but not be influential. Know the difference.

Simple random sample (SRS)... be careful here, and make sure you understand SRS. To have an SRS of size 4 (for instance), all groups of four must have an equal probability of being chosen. It is not enough to say each individual has an equal probability of being chosen. You can have plenty of random samples that are not SRS's. Remember our class... 5 boys and 5 girls. I can get a random sample of size 2 by flipping a coin... if heads, I'll randomly choose two of the girls. If tails, I'll randomly choose two of the boys. Each student has an equal probability of being selected, and the sample is clearly random... but it is not an SRS of size 2. Understand this!

Don't confuse blocking with stratification. There is some overlap, but they are not the same thing. Blocking is generally done to reduce variation. Stratification is generally used to get representations from various groups. A stratified random sample is not an SRS.

Multi-stage random sampling. Make sure you understand what this is. (Check text, or see me if necessary). A multi-stage random sample is not an SRS.

====================

Key points on sampling (and this often confuses people)...

The larger the sample size, the less variation you will have in your sample statistic. If you have a large population and want to estimate a population parameter, an SRS of size 500 is more desirable than an SRS of size 100 because the sample statistic (possibly a mean or a proportion) from the larger sample will have less variation than the one from a smaller sample... and will most likely (but not always) better represent the population parameter.

However (and here's where folks sometimes get confused)... An SRS of size 200 taken from a population of size 5,000 can be expected to have the same statistical variability as an SRS of size 200 taken from a population of size 500,000. (An analogy might be helpful... consider room M-4 filled with M&M's.... and consider the new Cate gym filled with M&M's. If I take an SRS of size 200 from each place, I would expect the variation of "reds" in each sample to be the same, despite the two greatly-different-size populations from which the sample were taken.)

========================

Don't forget (or misuse, as many do) the Central Limit Theorem which deals with sample means. (Remember that you saw this amazing theorem demonstrated multiple times on the classroom computer screen.) Understand what the CLT says. In a nutshell, if you have a population with mean m and standard deviation s (and the population itself does not have to be normal), and if you consider the means all simple random samples of size N, then

• the distribution of the sample means is approximately normal. (Catch the word approximately)
• the mean of the collection of means is equal m . (The mean is an unbiased statistic.)
• the standard deviation of the sample means is equal to s / Ã(N)...... Don't blow this part on the AP... many do!!! The standard deviation is not an unbiased statistic.

==============

You've been through a sophisticated college-level course... and the AP exam will be sophisticated and rigorous.

Be 100% (yes, I mean 100%, not 99.9%) sure that you know the terminology relating to

Type I Error

Type II Error

Power of a Test

KNOW THE TERMINOLOGY

Type I Error: Null hypothesis is true, but it is rejected. (You can only make a Type I Error if Ho is true.)

Type II Error: Null hypothesis is false, but it is accepted. (You can only make a Type II Error is Ho is false.)

Power of a test: This is the probability that a false null hypothesis is correctly rejected. It is 1 - Probability (Type II Error)

When you set a level of significance in a quality control test, you are setting the probability of making a Type I Error.

If you want to decrease the probability of a Type I Error, you must accept the reality that you may well increase the probability of a Type II error, and vice-versa. In real-life situations, you must make a decision as to which type of error you want to minimize.

If you calculate a P-value to be 14% (for instance), then you have a 14% chance of being wrong if you reject a null hypothesis Ho. (In other words, you have a 14% chance of making a Type I Error. If you reject Ho, there is a 14% chance you are wrong.)

W.H. Deming tried to get the U.S. automobile industry in Detroit to understand these ideas. Detroit would not listen. The Japanese did! The rest is history.

Have I told you that AP Statistics is sophisticated?

MATH POWER TO ALL

=================

Know the interpretation of r2. (This is a good possibility for multiple choice questions... make sure you know how to interpret r2.)

* This is the proportion of the variation in y (response variable) that is explained by the least-squares regression line of of y on x (explanatory variable)

* If r2 = .78 (for instance), then 78% of the variation in y is explained by the least-squares regression line. Note that the value of r could be plus or minus here.

* If r = -.86, then r2 = (-.86)2 = .7396, and we say that about 74% of the variation in y is explained by the least-squares regression line.

Shifting data, expanding (or contracting) data.

Given a numerical data set, say {a,b,c,d,e}. Let mean = m, standard dev. = s, variance = s 2.

* Adding a number (say 5) to each value merely shifts the data. The standard deviation and variance would not change.

* For the set (3a,3b,3c,3d,3e}, the mean would be 3 m , the standard deviation would be 3 s, and the variance would be 9 s2 .

t-test

* Used when you have a sample, but don't know the population standard deviation. In this case, you use the s statistic, and degrees of freedom are involved. The t-distribution is not normal, but it approaches normality as sample size gets larger. A general rule of thumb says the normal distribution can be used if a sample size is greater than 30.

* If you run a t-test, remember to check the "shape" of the sample. In general, t-test is OK if sample does not have extreme outliers.

binomial --> normal as sample gets larger.

=====

A population is 30% Jewish. Consider a random sample of size 5. Suppose we are interested in the proportion of Jews in the sample. This is clearly binomial. Small sample, and 5(.3) < 10, 5(.7) < 10.

Probability(2 or fewer Jews in sample) = binomcdf(5,.3,2) = .836.

=====

A population is 30% Jewish. Consider a random sample of size 200. The distribution of the proportion of Jews is still binomial, but is best approximated by the normal distribution. 200(.3) and 200(.7) are both 10 or greater. OK, make sure you get this straight. In this case, there would be 201 possible sample proportions for Jews. The mean of the distribution of proportions is .3 and the standard deviation is sqrt[(.3)(.7)/200] = .0324. Suppose we want the probability that a sample of 200 has 50 or fewer Jews. OK, 50 Jews represents a sample proportion of .25. The desired probability is

normalcdf(-1E99,.25,.3,.0324) = .06139.

Again, this is definitely binomial, and I point out that your calculator can handle

binomcdf(200,.3,50) = .0695 (Note this is reasonably close to the binomial approximation).

However, keep in mind your calculator has limitations on binomial computations. Be fully aware that the normal is generally used for large sample sizes.

As silly as this statement may seem, keep reminding yourself that you have had a course in college-level statistics. Any question, be it multiple choice or essay, is testing something that you supposedly have learned. Try to identify the topic involved in each question, and respond accordingly. There is clearly something statistical behind each question. (If an answer appears too obvious, stop and think! )

==================================================

* Make sure your calculator batteries are OK. Put in fresh batteries if yours have been in use for a long time. (Don't lose time and benefits because your calculator fails.)

* Not a bad idea to have a backup calculator.

* The exam will be very sophisticated. The national distribution of scores will will be lower than that of other AP's for a variety of reasons. So... don't rush! Work carefully... and think!!! Remember, you're in the same boat with the anticipated 38,000 students nationwide who will take the test.

On the multiple choice (50% of your total score determined here.)

* Read carefully. Highlight key words and phrases, particularly if there is a lot of reading to do.

* Scan the answers to try to anticipate the subject of the question.

* Don't rush... think carefully. If an answer is too obvious, and it's one you would have chosen before you took a college-level statistics course, it's probably not the right one.

* Answer as many as you can, but don't wildly guess. You don't have to answer every question to get a good score.

On the essays (50% of your total score determined here)

* There are six. Remember that the sixth question is 25% of the essay portion, which is 12.5% of your total exam score. You are told to spend 25 minutes on it. Be aware of time. In past years, the sixth question has had multiple parts. The key here is not to be frightened by the way the question looks (and it often appears to be scary). Read and highlight... the first parts of question #6 shouldn't be bad if you understand the premise.

* Read the questions carefully. Highlight important words and phrases.

* On all essay questions....Don't write too much. A few carefully written words and phrases is enough if you are precise and on the mark. The readers are looking for certain key things, not volumes and volumes of words that may say absolutely nothing in a statistical sense. Sometimes a bulleted approach is good when responding to essay questions.

* Watch your use of words. Sometimes just one or two words make a tremendous difference in the amount of credit given for a response. For instance,

-as sample size increases, the binomial distribution "approaches" the normal distribution. (It doesn't "become" normal.)

-the slope of the least-squares regression line gives you the "approximate" increase in y for a unit increase in x.

=========================================

Make sure you understand what a confidence interval is. This is almost certain to surface somehow. If you have a 95% confidence internal for a mean (for instance), then there is a 95% chance that you have an interval that contains the population mean. It is technically not correct to say that there is a 95% chance that the mean is in the interval you have constructed. When defining a confidence interval, you need to give a statement about an interval, not about a parameter. Another way to correctly explain a 95% confidence interval is to say that 95% of the confidence intervals constructed by the method would contain the mean.

Just a review (since you are almost certain to see this somehow)...

Type I Error: Occurs when null hypothesis is true, but you reject it.

Type II Error: Occurs when null hypothesis is false, but you accept it.

Power of a Test = 1 - Probability (Type II error).

If the probability of a Type II error is 11%, then the power of the test is 89%.

========================================

OK... enough rambling.

Take the test calmly. Read carefully. Remember you have had a course in college-level statistics (Don't start babbling on the essays... It's easy to do if you lose your concentration!)

Just do the best that you can. That's all anybody can ask!

MATH POWER TO ALL.