"To be a statistician is great! Younever have to be 'absolutely sure' of something. Being 'reasonablycertain' is enough."

Pavel E. Guarisma, North Carolina StateUniversity


Here are those AP STATers again. Each is holding a copy of The Practice of Statistics, by Yates, Moore, McCabe. Considering the textbooks as points, what characteristic would the least-squares regression line possess?

(A) It would be approximately y = x.

(B) It would be approximately horizontal.

(C) It would be approximately vertical.

3.3 LEAST-SQUARES REGRESSION (Pages 137- 160)

OVERVIEW: If a scatterplot shows a linear relationship between two quantitative variables, least-squares regression is a method for finding a line that summarizes the relationship between the two variables, at least within the domain of the explanatory variable, x. The least-squares regression line (LSRL) is a mathematical model for the data.

Regression Line: A straight line that describes how aresponse variable y chances as an explanatory variable x changes. Itcan sometimes be used to predict the value of y for a given value ofx.

A residual is a difference between an observed y and apredicted y.

Important facts about the least squares regression line.

r2 in regression: The coefficient ofdetermination, r2, is the fraction of the variation in thevalues of y that is explained the least squares regression of y on x.

Calculation of r2 for a simple example:

r2 = (SSM-SSE)/SSM, where

SSM = sum(y-y)2 (Sum ofsquares about the mean y)
SSM = sum(y-y(hat))2 (Sum of squares of residuals)

In this example, y(hat) = 2 + 2.25x, the mean of x is 4, and themean of y is 11.

x

y

y-11

(y-11)2

y(hat)

residual=y-y(hat)

(residual)2

2

6

-5

25

6.5

-0.5

0.25

4

12

1

1

11.0

1.0

1.00

6

15

4

16

15.5

-0.5

0.25

TOTALS

0

42 = SM

0.0

1.50 = SSE

r2 = (SSM-SSE)/SSM =(42-1.5)/42 = 0.9642857143

THINGS TO NOTE:

Outlier: A point that lies outside the overall pattern ofthe other points in a scatterplot. (It can be an outlier in the xdirection, in the y direction, or in both directions.)

Influential point: A point that, if removed, wouldconsiderably change the position of the regression line. (Points thatare outliers in the x direction are often influential.)

NOTE: Do not confuse the slope b of the LSRL with the correlationr. The relation between the two is given by the formula b =r(sy/sx). If you are working with normalizeddata, then b does equal r since sy = sx = 1.(When you normalize a data set, the normalized data has mean = 0 andstandard deviation = 1.) If you are working with normalized data, theregression line has the sample form yn = rxn,where xn and yn are normalized x and y values,respectively. Since the regression line contains the mean of x andthe mean of y, and since normalized data has a mean of 0, theregression line for normalized x and y values contains (0,0).

PHACS (Procedure, Hypothesis, Assumptions, Calculations,Summarize)

RETURN TO TEXTBOOK HOME PAGE /Back to the top of this page