Introduction  Confidence Interval  p Values  Power and Sample Size  Exercises
We consider two independent groups derived by either a cohort or crosssectional sample. One group is exposed to a factor while the other is nonexposed. Each individual is classified as diseased or not diseased according to defined criteria. Data are crosstabulated with cells labeled as follows:
Disease+  Disease  
Exposure +  a  b  n_{1} 
Exposure   c  d  n_{2} 
m_{1}  m_{2}  N 
Let p_{1} represent the proportion in the exposed group (p_{1} = a / n_{1}) and p_{2} represent the proportion in the nonexposed group (p_{2} = c / n_{2}). The ratio of p_{1} and p_{2}  the proportion ratio  is often referred to as the risk ratio or relative risk:
rr = p_{1} / p_{2}
Notation: Lower case acronyms denote estimators, while upper case represent parameters. Thus, rr represents the risk ratio estimate and RR denotes the risk ratio parameter.
Illustrative data TOXIC.REC . As an example we consider a cohort of cancer patients undergoing bone marrow ablation with the drug cytarabine (Jolson, et al, 1992). One group is exposed to (i.e., treated with) a generic drug while the other group uses the innovator manufacturer's product (and are thus nonexposed). Exposure information is stored in the variable GENERIC (exposed: 1 = yes, 2 = no). The disease information denotes cerebellar toxicity as stored in the variable TOX (1= yes, 2 = no). The first three records and last record of the data set are:
REC GENERIC TOX
  
1 1 1
2 1 2
3 1 2

 
59 1 2
Data are crosstabulated with the command:
EPI6> TABLES <exposure> <disease>
where <exposure> and <disease> represent the names of the exposure and disease variables, respectively.
For example, to crosstabulate the current data issue the command:
EPI6> TABLES GENERIC TOX
Output is:
TOX
GENERIC  1 2 Total
++
1  11 14  25
2  3 31  34
++
Total 14 45 59
Thus, the incidence of toxicity in the exposed group (p_{1}) = 11 / 25 = 0.440, the incidence in the unexposed group (p_{2}) = 3 / 34 = 0.088, and rr = 0.440 / 0.088 = 4.99 @ 5.0, indicating that toxicity was 5 times more frequent in the exposed group than in the nonexposed group.
Risk ratio estimates are printed in the output below the 2by2 table. For the illustrative example:
RISK RATIO(RR)(Outcome:TOX=1; Exposure:GENERIC=1) 4.99
95% confidence limits for RR 1.55 < RR < 16.03
The confidence interval assumes data are free of biases. Since this is unrealisticthe confidence interval should be viewed as a rough estimate of the parameter.
Epi Info calculates three different chisquared statistics to test H_{0}: RR = 1. These are:
ChiSquares Pvalues
 
Uncorrected: 9.85 0.00169835
MantelHaenszel: 9.68 0.00185979
Yates corrected: 8.00 0.00467202
Interpretation: Some statisticians use benchmarks to help interpret the p value. Benchmarks of .05 or .01 are common. Thus, if p < .01 the association is declared significant (i.e., not easily explained by chance). Thus, each of the above p values provides evidence against H_{0}. More importantly, the p value should NOT be interpreted in isolation  it should be interpreted in light of other evidence (Fisher, 1935).
Assumptions: Chisquare tests, assume data are valid (no information bias, no selection bias, no confounding). They also assume sampling independence and expected frequencies greater than or equal to 5. When an expected frequency in the crosstabulation is less than 5, Epi Info issues the warning: An expected value is less than 5; recommend Fisher exact results
Fisher's exact test is based on summing exact binomial probabilities for permutations that are equally or more extreme than observed results, assuming the null hypothesis is true and the table's margins are fixed. This procedure is explained in Rosner, 1995, p. 376.
Illustrative Data. To illustrate Fisher's test, let us consider a study performed to explore the relation between a drug called Kayexelate(R) and the occurrence of colonic necrosis in postoperative patients (Gerstman et al., 1992). This study compares colonic necrosis rates in postoperatively exposed and nonpatients. Data are stored in KXNECRO.ZIP in KXNECRO.REC as variables KX (exposed to Kayexelate: Y/N) and NECRO (colonic necrosis: Y/N).
Data are processed with the commands:
EPI6> READ KXNECRO
EPI6> TABLES KX NECRO
Output is:
NECRO
KX  +   Total
++
+  2 115  117
  0 862  862
++
Total  2 977  979
Results show 2 of the 117 Kayexelateexposed patients experienced colonic necrosis. In contrast, 0 of 862 nonexposed patients experienced colonic necrosis . Thus, p_{1} = 2 / 117 = 1.7% and p_{2} = 0 / 862 = 0.0%. The risk ratio = 1.7% / 0.0% = undefined, with a limit of positive infinity.
EpiInfo is unable to calculate a confidence interval for these data but tests H_{0}: RR = 1 with Fisher's test:
Fisher exact: 1tailed Pvalue: 0.0141750 <
2tailed Pvalue: 0.0141750 <
Comment: Like all statistical tests, Fisher's procedure assumes perfect validity (no confounding, no information bias, no selection bias). It also assumes sampling independence.
The power and precision of inferences depends on the number of subjects in the exposed group (n_{1}), the ratio of nonexposed to exposed subjects (n_{2 }/ n_{1}), the RR "worth detecting," the incidence in the nonexposed population (p_{2}), and the alpha level of the inference (a). We may use the program EpiTable > Sample > Power calculation > Cohort Study to perform power computations (method based on Fleiss, 1981, pp. 44  45).
Illustrative example. A study has 100 exposed subjects, 100 unexposed subjects (allocation ratio = 100/100 = 1), an expected incidence of 10% in the unexposed group, an a level of 0.05, and an expected RR of 2. Based on these assumptions, EpiTable calculates power = 42.4%. This is considered inadequate. (Power should be at least 80%, preferably 90%.)
Comment: In determining sample size requirements, the investigator must have some idea of the order of magnitude of proportions he or she is looking for. This knowledge might come from previous research, from an accumulation of clinical experience, from smallscale pilot work, or from readily available sources of statistics (e.g., morbidity surveys). Given at least some information, the investigator can, using his or her imagination and expertise, come up with an estimate of a difference between two proportions that is scientifically or clinically important. Given no information, the investigator has no basis for designing the study intelligently and would be hard put to justify designing it at all (paraphrased from Fleiss, 1981, p. 34) .
Sample size calculations can be viewed as "power calculations in reverse." Here, we specify the required power (or precision) to derive a reasonable estimate of the sample size required for a given study. We will use the program EpiTable > Sample > Sample Size > Cohort Study for sample size requirement calculations.
Illustrative example. To achieve 80% power to detect a RR of 2 in a study with an allocation ratio of 1:1 nonexposed to exposed subjects, and an expected incidence in the nonexposed group of 10%, with 1  a = .95, EpiTable determines n_{1} = n_{2} = 219. To achieve 80% power to detect a RR of 3, we need n_{1} = n_{2} = 72.
(1) EAR.ZIP: Otitis Media Clinical Trial (Source of data: Rosner, 1990, p. 68,). Data are from a clinical trial on the treatment of acute
otitis media in children. Group 1 received a 14day trial of cefaclor. Group 2 received a 14day trial of amoxicillin. This information is
contained in the variable called AB (1 = cefaclor, 2 = amoxicillin). A total of 278 infectedears were treated, with clearance of infection
represented in variable CLEAR (1 = yes, 2 = no). Download the data set and then perform each of the following analyses:
(A) Calculate the incidence of clearance associated with each of the antibiotics. Include a 95% confidence interval for the RR.
(B) Test the risk ratio for significance. Report relevant hypotheses testing steps.
(C) Briefly summarize your findings.
(2) PRISON.ZIP: Human Immunodeficiency Virus Infection in a Women's Correctional Institution (Smith et al., 1991). A study of HIV infection in women entering the New York State Prison system crossclassified 465 inmates with respect to HIV seropositivity (HIV) and history of intravenous drug use (IVDU). Download this data set and then calculate the prevalence of HIV in each exposure group. Calculate the prevalence ratio. Include a 95% confidence interval. Interpret your findings.
(3) LABOR.ZIP: Induction of Labor and Meconium Staining. Induced labor (by administering pitocin and other hormones) in nearterm pregnancies is a common obstetrical procedure which is intended to reduce the risk of complications. Meconium staining during childbirth is a sign of fetal distress. Use the data LABOR.REC to determine whether there is an association between induction (INDUCE) and meconium staining (MECON). Include relevant descriptive and inferential statistics, and summarize your findings in plain English.
(4) OSWEGO.ZIP: Food Poisoning in Oswego, New York (Centers for Disease Control, 1992). Data from an outbreak of gastrointestinal illness following a church supper in upstate New York are reported in OSWEGO.REC. Variables in the data set are selfexplanatory (use the VARIABLES command to see variable names). Based on these data, fill in the table below and determine the most likely source of agent.
Food  Ate Food  Did Not Eat Food  Risk Ratio  95% conf. int.  p*  
Ill  Total  %  Ill  Total  %  
Baked Ham  29  46  63.0%  17  29  58.6%  1.1  0.7  1.6  .70 
Spinach  ___  ___  ___  ___  ___  ___  ___  ___  ___ 
Mashed Potato  ___  ___  ___  ___  ___  ___  ___  ___  ___ 
Cabbage Salad  ___  ___  ___  ___  ___  ___  ___  ___  ___ 
JellO  ___  ___  ___  ___  ___  ___  ___  ___  ___ 
Rolls  ___  ___  ___  ___  ___  ___  ___  ___  ___ 
Brown bread  ___  ___  ___  ___  ___  ___  ___  ___  ___ 
Milk  ___  ___  ___  ___  ___  ___  ___  ___  ___ 
Coffee  ___  ___  ___  ___  ___  ___  ___  ___  ___ 
Water  ___  ___  ___  ___  ___  ___  ___  ___  ___ 
Cakes  ___  ___  ___  ___  ___  ___  ___  ___  ___ 
Van. ice cream  ___  ___  ___  ___  ___  ___  ___  ___  ___ 
Choc. ice cream  ___  ___  ___  ___  ___  ___  ___  ___  ___ 
Fruit salad  ___  ___  ___  ___  ___  ___  ___  ___  ___ 
* uncorrected chisquare or Fisher's exact test, as appropriate.
(5) RESTENOS: Restenosis Following Coronary Atherectomy (Zhou et al., 1996). Each year, cardiologists open many clogged arteries only to have these same arteries restenose following surgery. A study sponsored by the NIH / Heart, Lung and Blood Institute was performed to determine whether silent infection with a common virus (cytomegalovirus) was predictive of the regrowth of arterial plaque. In 21 of the 49 patients with serologic evidence of cytomegalovirus infection, regrowth of arterial plaque was noted. In contrast, 2 of the 26 patients without serologic evidence of cytomegalovirus had plaque regrowth. Construct a 2by2 table for these data. Then calculate the risk ratio associated with cytomegalovirus infection. Include a 95% confidence interval. (You may use an epidemiological calculator such as EpiInfo > STATCALC for your calculation.) Do data support the theory that subclinical viral infections may play a role in arteriosclerosis?
(6) PHENFORM: Phenformin and Cardiovascular Death (Osborn, 1979). In a clinical trial, 26 out 204 patients treated with phenformin died of cardiovascular disease. In contrast, 2 of 64 control patients died of cardiovascular disease. Calculate the incidence of cardiovascular death in each group and then calculate the risk ratio associated with phenformin. Include a 95% confidence interval for the RR. In plain English, interpret your results.
(7) SIZECOH: Power and Sample Size Exercises.
(A) Suppose you want to complete a study with a = 0.05; power = 0.8; allocation ratio = 1:1, and background rate (p_{2}) of 25%. What
size sample is needed to detect RR = 2? RR = 3? RR = 4?
(B) What is the power of a study looking for RR = 2, assuming n_{1} = 50, n_{2} = 100, p_{2} = 5%, and a = 0.05. What if the true RR = 2? What
if RR = 3?
(8) BIHELM1.ZIP: Bicycle Helmet Use in Two Northern California Counties (Perales et al., 1994). In 1991, 1491 bicyclists were hospitalized for head injuries in California. BIHELM1.REC contains bicycle helmet use data for 1651 bicycle riders in two northern California counties. Data can be downloaded by clicking on the highlighted file name. A code book is included in the ZIP file. After downloading this data set, calculate the helmetuse rate in the Santa Clara County and in Contra Costa County. (Report relevant counts and proportions.) Report the incidence ratio. Include a 95% confidence interval, and interpret your results.
(9) OC/MI. A study was conducted to look at the effects of oral contraceptive use (OC) on heart disease in women 40 to 44years of age. Thirteen incident myocardial infarctions (MI) were found in 5000 current OC users during 3years of observation. In contrast, 7 cases were seen in 10,000 nonusers. Compare the incidence of MI in the groups using methods learned in this unit. Make certain to summarize your results in plain language.
Fisher, R. A. (1935). The logic of inductive inference. Journal of the Royal Statistical Society, 98, 3954.
Fleiss, J. L. (1981). Statistical Methods for Rates and Proportions. Second Edition. New York: John Wiley & Sons.
Greenland, S., & Robins, J. M. (1985). Estimation of a common effect parameter from sparse followup data. Biometrics, 41(1), 5568.
Osborn, J. F. (1979). Statistical Exercises in Medical Research. New York: John Wiley & Sons.
Rosner, B. (1990). Fundamentals of Biostatistics ( Third ed.). Belmont, CA: Duxbury Press.
Rothman, K. J., & Greenland, S. (1998). Modern Epidemiology ( Second ed.). Philadelphia: LippincottRaven.
Smith, P. F., Mikl, J., Truman, B. I., Lessner, L., Lehman, J. S., Stevens, R. W., Lord, E. A., Broaddus, R. K., & Morse, D. L. (1991). HIV infection among women entering the New York State correctional system. Am J Public Health, 81 Suppl, 3540.
Zhou, Y. F., Leon, M. B., Waclawiw, M. A., Popma, J. J., Yu, Z. X., Finkel, T., & Epstein, S. E. (1996). Association between prior cytomegalovirus infection and the risk of restenosis after coronary atherectomy. N Engl J Med, 335(9), 624630.