"The primary question is not 'What dowe know?', but 'How do we know it?' "

Aristotle, to Thales


4.3 RELATIONS IN CATEGORICAL DATA (Pages215-226)

OVERVIEW: We can see relations betweentwo or more categorical variables by setting up tables. Up to thispoint, we have studied relations in which at least the responsevariable was quantitative.

A two-waytable of counts describes the relationshipbetween two categorical variables... the row variable and the columnvariable. The row totals and column totals give the marginaldistributions of the two variables separately, but do not give anyinformation about the relationships between the variables.Probabilities, including conditional probabilities, can be calculatedfrom two-way tables.

Simple example:
Notation:
...Prob(X) is the probability that X is true.
...Prob(X|Y) is the probability that X is true, given that Y istrue.

Two hundred employees of a company are classifiedaccording to the following 2-by-3 table, where A, B, and C aremutually exclusive properties.

Have A

Have B

Have C

ROW TOTALS

FEMALE

20

40

60

120

MALE

30

10

40

80

COLUMN TOTALS

50

50

100

200

o What is the probability that a randomly chosenperson is female?

Ans. Prob(F) = 120/200 = 60%.

o What is the probability that a randomly chosenperson has property A?

Ans. Prob(A) = 50/200 = 25%.

o If a randomly chosen person is female, what isthe probability that she has property B?

Ans. Prob(B|F) = 40/120 = 33 1/3% [=prob(B and F)/prob(F).]

o If a randomly chosen person has property C, whatis the probability that the individual is a male?

Ans. Prob(M|C) = 40/100 = 40% [=prob(C and M)/prob(C).]

o If a randomly chosen person has B or C, what isthe probability that the person is a male?

Ans. Prob(M|B or C) = 50/150 = 331/3%.

===================================

An example of Simpson'sparadox:

Here are the batting averages of two baseballplayers for both halves of a season.
[Batting average is simply the ratio of
number of hits tonumber of times at bat.]

FIRST HALF-SEASON

SECOND HALF-SEASON

Hits

Times at bat

Batting average

Hits

Times at bat

Batting average

Caldwell

60

200

.300

50

200

.250

Wilson

29

100

.290

1

5

.200

Here are the batting averages for the entireseason.

Caldwell: 110/400 =.275

Wilson: 30/105 = .286

Caldwell, despite having a better average thanWilson for both halves of the season, ends up with an overall averagethat is less than that of Wilson. Using percentages, one canconstruct numerous examples of Simpson's paradox.

From an algebraicstandpoint:
If a/b > c/d and p/q > r/s, then
...it is true that a/b + p/q > c/d + r/s.
...it is not necessarily true that (a+p)/(b+q) >(c+r)/(d+s).

RETURN TO TEXTBOOK HOME PAGE /Back to the top of this page