"The primary question is not 'What dowe know?', but 'How do we know it?' "

OVERVIEW: We can see relations betweentwo or more categorical variables by setting up tables. Up to thispoint, we have studied relations in which at least the responsevariable was quantitative.

A __two-waytable__ of counts describes the relationshipbetween two categorical variables... the row variable and the columnvariable. The row totals and column totals give the marginaldistributions of the two variables separately, but do not give anyinformation about the relationships between the variables.Probabilities, including conditional probabilities, can be calculatedfrom two-way tables.

Simple example:

Notation:

...Prob(X) is the probability that X is true.

...Prob(X|Y) is the probability that X is true, given that Y istrue.Two hundred employees of a company are classifiedaccording to the following 2-by-3 table, where A, B, and C aremutually exclusive properties.

Have A

Have B

Have C

ROW TOTALSFEMALE 20 40 60 120MALE 30 10 40 80

COLUMN TOTALS5050100200o What is the probability that a randomly chosenperson is female?

Ans. Prob(F) = 120/200 = 60%.

o What is the probability that a randomly chosenperson has property A?

Ans. Prob(A) = 50/200 = 25%.

o If a randomly chosen person is female, what isthe probability that she has property B?

Ans. Prob(B|F) = 40/120 = 33 1/3% [=prob(B and F)/prob(F).]

o If a randomly chosen person has property C, whatis the probability that the individual is a male?

Ans. Prob(M|C) = 40/100 = 40% [=prob(C and M)/prob(C).]

o If a randomly chosen person has B or C, what isthe probability that the person is a male?

Ans. Prob(M|B or C) = 50/150 = 331/3%.

===================================

An example of **Simpson'sparadox:**

Here are the batting averages of two baseballplayers for both halves of a season.

[Batting average is simply the ratio of __ number of hits__ to

FIRST HALF-SEASON

SECOND HALF-SEASON

Hits

Times at bat

Batting average

Hits

Times at bat

Batting average

Caldwell

60 200

.300

50 200

.250Wilson

29 100

.290

1 5

.200

Here are the batting averages for the entireseason.

Caldwell: 110/400 =

.275Wilson: 30/105 =

.286

Caldwell, despite having a better average thanWilson for both halves of the season, ends up with an overall averagethat is less than that of Wilson. Using percentages, one canconstruct numerous examples of Simpson's paradox.

From an algebraicstandpoint:

If a/b > c/d and p/q > r/s, then

...it is true that a/b + p/q > c/d + r/s.

...it is not necessarily true that (a+p)/(b+q) >(c+r)/(d+s).

RETURN TO TEXTBOOK HOME PAGE /Back to the top of this page