**| Background | Descriptive Statistics | Test of Independence | Review Questions | Exercises | References |**

In this chapter we compare a nominal outcome among several groups. The study outcome is stored as a text or numerically-encoded dependent variable. The study predicto is is stored in a separate text or numerically-encoded independent variable.

** Illustrative example. **Techniques in this chapter will be illustrated with the case-control data set

Record |
CASE |
ALC |

1 | 2 | 1 |

2 | 2 | 1 |

3 | 2 | 1 |

. | . | . |

975 | 1 | 4 |

Suggestion: Download the data set, unzip it and open the file. View its content and take note of its structure.

Before data are tested, they are cross-tabulated to form of an ** r-by-c contingency table**, where

Alcohol Consumption(grams / day) |
Cases (n = 200) |
Controls (n = 775) |

0-39 | 29 | 386 |

40-79 | 75 | 280 |

80-119 | 51 | 87 |

120+ | 45 | 22 |

Notice that this data might just as easily have been set up as a 2-by-4 contingency table, having case status represented along rows and
alcohol consumption level represented along columns. However, this would **not **materially change the data or conclusions to follow.

**For the sake of consistency in this chapter, **let us arrange our tables with the dependent variable listed down table rows and the
independent variable (groups) listed across table columns. Let *n _{i}*

Variable (dependent var)R |
Variable (independent var)C |
||||

1 | 2 | ... | c |
Total | |

1 | x_{1,1} |
x_{1,2} |
... | x_{1,c} |
m_{1} |

2 | x_{2,1} |
x_{2,2} |
... | x_{2,c} |
m_{2} |

. | ... | ... | ... | ... | . |

r |
x_{r,}_{1} |
x_{r,2} |
... | x_{r,c}_{} |
m_{r}_{} |

Total | n_{1} |
n_{2} |
... | n_{c}_{} |
N |

To cross-tabulate data in EpiInfo, issue the commands:

`EPI6> READ <x:\path\dataset.rec>
EPI6> TABLES <rowvar> <columnvar>`

For example, to cross-tabulate the illustrative example, issue the commands:

`EPI6> READ A:\BD1
EPI6> TABLES ALC CASE`

Output is:

` CASE
ALC | 1 2 | Total
-----------+---------------+------
1 | 29 386 | 415
2 | 75 280 | 355
3 | 51 87 | 138
4 | 45 22 | 67
-----------+---------------+------
Total | 200 775 | 975`

Suggestion: Cross-tabulate the data in Epi Info.

We want to compute **relative frequencies (percentages) within groups**. The percentage of people in group *i* with characteristic *j* = *x _{i,,j}*

To have Epi Info calculate cell percentages, issue the command:

`EPI6> SET PERCENTS = ON`

Then reissue the TABLES command:

`EPI6> TABLES ALC CASE`

**Output **is:

` CASE
ALC | 1 2 | Total
-----------+---------------+------
1 | 29 386 | 415
> 7.0% 93.0% > 42.6%
| 14.5% 49.8% |
2 | 75 280 | 355
> 21.1% 78.9% > 36.4%
| 37.5% 36.1% |
3 | 51 87 | 138
> 37.0% 63.0% > 14.2%
| 25.5% 11.2% |
4 | 45 22 | 67
> 67.2% 32.8% > 6.9%
| 22.5% 2.8% |
-----------+---------------+------
Total | 200 775 | 975
| 20.5% 79.5% |`

Observe that each cell now displays counts and two percentages: the **row percent** (indicated with a ">") shows the cell count as a
percentage of the row total. The **column percent** shows the cell count as a percentage of the column total.*Since we are currently
interested in percentages within groups, we should focus on column percentages. *Notice that cases are more likely than controls to fall
into high `ALC` levels.

Inferential methods in this chapter are based on the **chi-square probability function, **as introduced by Karl Pearson (circa 1900) with
later development by his son-in-law, R. A. Fisher. Chi-square distributions are asymmetrical probability functions with long right tails.
The area under the distribution is used to quantify the probability of random occurrences. Although chi-square distributions have many
uses, this chapter focuses on their use in testing whether joint probabilities of discrete occurrence are independent. This test is called
the **chi-squared test of independence**.

The **null hypothesis **is that the row and column variables are independent. The **alternative hypothesis **is that the row and column
variables are dependent. This is equivalent to:

*H*_{0}: no significant association between row and column variables

*H*_{1}: association between the row and column variables

The **alpha **level is set before submitting data to testing. In this instance, let alpha = .01 (just for a change).

The **chi-square test **is used to perform the test. This test is based on a comparison of observed cell counts to expected counts.
**Expected counts **represent hypothetical values that would occur if there were no association between the variables being tested. The
expected count in each table cell is calculated:

expected count = (row total * column total / *N*)

For the illustrative example, expected counts are:

Alcohol Consumption
(grams / day) |
Esophageal Cancer |
||

Yes | No | Total | |

0-39 | (415 * 200) / 975 = 85.13 | (415 * 775) / 975 = 329.87 | 415 |

40-79 | (335 * 200) / 975 = 72.82 | (355 * 775) / 975 = 282.18 | 355 |

80-119 | (138 * 200) / 975 = 28.31 | (138 * 775) / 975 = 109.69 | 138 |

120+ | (67 * 200) / 975 = 13.74 | (67 * 775) / 975 = 53.26 | 67 |

Total | 200 | 775 | 975 |

The "observed - expected values" are called **residuals**.

residual = observed - expected

The residuals for the illustrative example are:

Alcohol
Consumption |
case |
control |

0 - 39 gm/day |
29 - 85.1 = -56.1 | 386 - 329.9 = 56.1 |

40 - 79 gm/day |
75 - 72.8 = 2.2 | 280 - 282.2 = -2.2 |

80 - 119 gm/day |
51 - 28.3 = 22.7 | 87 - 109.7 = -22.7 |

120+ gm/day |
45 - 13.7 = 31.3 | 22 - 53.3 = -31.3 |

These residuals let the user know how much each observed value is deviating from expected.

**Pearson's chi-square test statistic **is calculated:

Pearson's chi-square = SUM [residual^{2} / expected]

For the illustrative example, Pearson's chi-square = [-56.1^{2}/85.1 + 56.1^{2}/329.9 + 2.2^{2}/72.8 + -2.2^{2}/282.8 + 22.7^{2}/28.3 + -22.7^{2}/109.7 +
31.3^{2}/13.7 + -31.3^{2}/ 53.3] = 36.98 + 9.54 + 0.07 + 0.02 + 18.21 + 4.70 + 71.51 + 18.38 =159.41

**Under the null hypothesis, **Peason's chi-square statistic has a chi-square distribution with (*r* -1)(*c* -1) **degrees of freedom**, where *r*
represents the number of rows in the table and *c* represents the number of columns. Since we are currently testing a 4-by-2 table, *df* =
(4-1)(2-1) = 3. This information is used to compute a *p* value for the problem. For the illustrative example, the chi-square statistic =
159.00 with 3 *df*, *p* < .0005.

Pearson's chi-square statistic is computed by** **Epi Info whenever data are cross-tabulated. **Output** for the illustrative example is:

` Chi square = 158.95
Degrees of freedom = 3
p value = 0.00000000 <---`

The chi-square statistic, its associated degrees of freedom, and sample size basis (*N*) are usually reported when presenting chi-square
information. The **APA Publications Manual **suggests the following format: *X*²(*df*, *N* = xxx) = xx.xx, *p *= .xxx. The APA format for
the illustrative example is: *X*²(3, *N* = 975) = 159.00, *p* < .0005.

In instances when data are presented in cross-tabulated form, use the web *r×c* Contingency Table calculator at
http://www.physics.csbsju.edu/stats/contingency.html to calculate chi-square statistics.

The chi-square test of independence is based on the compilation of normal approximations, and hence assumes expected cell counts to be greater than or equal to 5. When this assumption is not met, alternative tests based on the binomial distribution must be used (e.g., Fisher's exact test)

(1) Fill in the blank: With a continuous outcome, descriptive statistics are based on sums and averages. With a categorical outcome, descriptive statistics are based on _________________ and ___________________.

ANS: counts and proportions

(2) What type of test is used to determine statistical significance between a continuous dependent variable and categorical independent variable?

ANS: An independent *t* test or analysis of variance.

(3) What type of test is used to determine statistical significance between a continuous dependent variable and continuous independent variable?

ANS: A *t* test can be used via the regression model.

(4) What type of test is used to determine statistically significance between a categorical dependent variable and categorical independent variable?

ANS: A chi-square test, as described in this chapter.

(5) List the (two-sided) null hypotheses used by each of the tests listed in (2) - (4), above.

ANS:

**Independent t test: **

(6) List the assumptions required of each of the above tests.

ANS: Using short descriptors,

**Independent t test**: Independence, Normality, Equal Variance

Data come from a survey of smoking and socioeconomic status. Five socioeconomic status groups are considered, with group 1 representing the lowest SES and group 5 representing the highest. Cigarette smoking status is categorized as 1 = current smoker, 2 = non-smoker. Data have already been cross-tabulated, as follows:

Socio-Economic Status | |||||

1 |
2 |
3 |
4 |
5 | |

1 (smoker) |
17 | 76 | 34 | 32 | 20 |

2 (non-smoker) |
40 | 195 | 88 | 53 | 30 |

- (A) Calculate the proportion (prevalence) of smoking within each SES category. (
*Comment*: Column totals must first be calculated. Prevalence represents the relative frequency of smoking within each category group.) - (B) By hand, calculate table of expected value. Are any expected counts less than 5?
- (C) Using the Web calculator, test the hypothesis of no association. State the null and alternative hypotheses. Let alpha = .05. Report the chi-square statistic, its degrees of freedom, and the total sample size, using the suggested APA format. State your conclusion. Is there a significant difference in the proportion of smokers by SES?

- You are familiar with this data set from its illustrative use in the chapter. In addition to alcohol consumption, this study considered tobacco consumption (variable TOB: 1 = 0-9 gms/day, 2 = 40-79 gms/day, 3 = 20-29 gms/day, 4 = 30+ gms/day). The case status of subjects is contained in variable CASE (1 = case, 2 = control).
- (A) Cross-tabulate the data, list tobacco consumption status along rows and case status along columns. Report the cross-tabulation.
- (B) Report the distribution of tobacco consumption percentages by case status.
- (C) Test the hypothesis of no association. State the null and alternative hypotheses. Let alpha = .05. Report the chi-square statistic, its degrees of freedom, and the total sample size, using the suggested APA format. State your conclusion. Is there a significant difference in tobacco consumption in cases and controls? How so (i.e., which group tends to consume more tobocco)?

You are familiar with this data set from previous examples. Briefly, data are from a respiratory health survey of children and adolescents from the East Boston, MA, area. Let us now focus on the relationship between SMOKE (dependent variable) and SEX (independent variable).

- (A) Cross-tabulate the data, listing the dependent variable along the row and independent variable along column of the table.
- (B) Report the proportion of boys and girls who smoke.
- (C) Perform a test of association. State the null and alternative hypotheses. Let alpha = .05. Report the chi-square statistic, its degrees of freedom, and the total sample size. (Use APA format, if possible.) State your conclusion. Are data significant? If so, how?

American Psychological Association [APA]. (1994). Publication Manual (4th ed.). Washington, DC: Author.

Chang, C. L., Selvin, S., Langhauser, C. (1983). *Biology and Public Health Statistics: BioEnv 130A*. Unpublished instructional
material, University of California, Berkeley.

Breslow, N. E., & Day, N. E. (1980). *Statistical Methods in Cancer Research. Volume 1--The Analysis of Case-Control Studies*. Lyon:
International Agency for Research on Cancer.

Tuyns, A. J., Péquignot, G., & Jensen, O. M. (1977). Le cancer de l'oesophage en Ille-et Vilaine en function des niveaux de
consommation d'alcool et de tabac. Des risques qui se multiplient. *Bull Cancer*, 64, 45 - 60.

Zar, J. H. (1996). Biostatistical Analysis. (3rd Ed.) Upper Saddle River, NJ: Prentice Hall.