Introduction | Confidence Interval | p Value | Sample Size Requirements | Exercises | Clustered Samples
This chapter considers the analysis of a binary ("yes/no") outcome measured in n individuals from a single group. Let x represent the number of "positives" in the group and y represent the number of "negatives." Thus, n = x + y and p represents the sample proportion:
p = x / n
When data represent a simple random sample from a large population, p is the estimator of binomial parameter P, and P represents the prevalence proportion or incidence proportion (risk) depending on the nature of the data (see Epidemiology Kept Simple, Chap 6).
Illustrative data. In a sample of 57 people, we find 17 smokers. (Data are stored in PREVSMOK.REC as the variable SMOKER: 1 = current smoker, 0 = non-smoker). Thus, the prevalence of smoking (p) = 17 / 57 = .298.
Frequencies and confidence intervals are derived with the FREQ command:
For our illustrative data set, the command is:
SMOKER | Freq Percent Cum. 95% Conf Limit
0 | 40 70.2% 70.2% 56.6%-81.6%
1 | 17 29.8% 100.0% 18.4%-43.4%
Total | 57 100.0%
Thus, the prevalence is 29.8% (95% interval for P: 18.4%, 43.4%). It appears as if calculation of the confidence interval uses a mathematical relation between the binomial distribution and F distribution (Fisher & Yates, 1963; Zar, 1996, p. 524). (CDC's documentation of the procedure is somewhat ambigious.)
An exact binomial test is to evaluate H0: P = P0, where P0 represents the binomial parameter under the null hypothesis. A p value for this test can be computed with EpiTable as follows: EpiTable > Probability > Binomial: Proportion vs. Standard. (EpiTable is part of Epi6, but is NOT included in Epi2002).
Illustrative example. Suppose we want to compare the prevalence of smoking in our sample (p = 17/57 = 29.8%) to that of the United States as a whole (P0). A National Center for Health Statistics surveys shows the prevalence of smoking in US adults is 24.8% (NCHS, 1995, Table 65). We test H0: P = 24.8% by assuming the number of smokers in our random sample is distributed as a binomial random variable with parameters P = .248 and n = 57 . Using EpiTable, we derive p-value = .36 (two-sided) for this problem. (little or no support for rejecting H0).
To derive a 95% confidence interval of P with a margin of error no greater than d, use the formula:
n = (3.84)(P)(Q) / d˛
where P is a pretty good guess for parameter P and Q = 1 - P.
Illustrative example. To achieve d = .05 assuming P = .25, n = (1.962)(.25)(1 - .25) / (.05)2 = 288.1 (about 288 people).
If no good estimate for P is available, let P = .50 to ensure a sufficient sample size.
Cluster sampling randomly selects units composed of smaller elements of interest. Examples of clusters include:
|Family||Members of the family|
|Carton of eggs||Individual eggs|
|Peach tree||Individual peaches|
|Patient||Multiple samples from same patient|
When cluster sampling is used, variance estimates must be modified by adding a design effect (deff). In effect, the design effect describes the relative change in the variance caused by cluster sampling. Use the program EpiTable (available in Epi6 but not in Epi2000) to calculate deff estimates for cluster sampling. Click EpiTable > Proportion > Design Effect.
Illustrative example. Suppose in five patients with severe psoriasis, with each patient receiving a parental treatment for his or her condition, the following results are observed:
|Patient||Number of lesions cleared by treatment||Number of lesions|
Each patient represents a cluster and each lesion an element in the cluster. To compute the design effect associated with clustering select EpiTable > Describe > Proportions > Design Effect. Output is
Clust Num Den
Nē 1 5 12
Nē 2 4 7
Nē 3 12 13
Nē 4 8 15
Nē 5 5 16
Global variance : 0.003943
Cluster variance : 0.010739
Design effect : 2.72
Thus, deff = 2.72. We then use EpiTable > Describe > Proportions > Cluster Sampling to calculate the confidence interval for P. Here is the output:
Total observations 63
Design effect 2.72
Total observations : 63
Design effect : 2.72
Effective sample size : 23
Proportion : 53.9683%
Fleiss quadratic 95% CI [32.7038-73.9924]
Thus, the 95% confidence interval for P is (32.7%, 74.0%). (If we had mistakenly assumed that data were a simple random sample, the confidence interval would have been: 41.0%, 66.4%).
(1) ELECT: A pre-election poll of 100 prospective voters shows 55 in favor of Candidate A. Use EpiTable or some other epidemiologic calculator to compute a 95% confidence interval for P, the percentage of the electorate favoring Candidate A. Based on this estimate, do you think results provide reliable evidence of a future victory for Candidate A? Explain your reasoning.
(2) BREASTCA: We expect 2% of women at age 50 to develop breast cancer within 5 years. Suppose that among 1000 women in this age range who have a mother with breast cancer, 32 develop breast cancer. How many cases would be expected in this group? Use EpiTable > Probability > Binomial to determine if the number of cases significantly greater than expected?
(3) PREGRATS: A laboratory test of the teratogenicity of an agent shows 12 malformed (rat) pups in a litter of 85. We normally expect a malformation rate of 5% in this species. Do data provide evidence of teratogenicity? Test H0: P = 0.05 at a = .01, one-sided.
(4) SMOKE.REC: The data set SMOKE.REC records the number of days each client successfully stays smoke-free after a smoking cessation program. (Data are recorded as the variable DAYS.) Download and unzip this data set. Read the data set into EpiInfo and then convert the variable DAYS into a dichotomous outcome indicating whether the person ceased smoking for at least a year. Compute a 95% confidence interval for recidivism proportion P. In your opinion, was the smoking cessation program successful?
(5) BINSIZE. Determine the sample size needed to calculate a 95% confidence interval for a proportion with a margin or error of no more than 10%, assuming an expected proportion of 50%. Recalculate the sample size requirements for the study assuming d = 5%.
(6) EDENTITION: A report of adult dental health in 25- to 34-year-old English women showed 20 of 262 women with missing teeth. Calculate a 95% confidence interval for P.
(7) FEV.ZIP Download the data set FEV. Unzip the ZIP file. A data definition (DD) file is included as one of the files in the ZIP archive. These data are from a survey of respiratory health (Rosner, 1995, p. 40; Tager et al., 1985). For each dichotomous variable in the data set, report relevant counts (x of n) and proportions. Also report 95% confidence for each proportion parameter.
(8) THERAPEUTIC_TOUCH. A study by an 11-year old girl made headlines for challenging the validity of a type of therapy ("Therapeutic Touch") in which the therapist's hands are passed over a patient's body without actually being laid on the patient, supposedly to manipulate human energy fields (Rosa et al., 1998). In the current experiment, Touch Therapists rested their hands, palms up, on a flat surface that was approximately 25 to 30 cm apart. To prevent the experimenter's hands from being seen, an opaque screen with cut-outs at its base was placed over each subject's arm and a cloth towel was attached to the screen and draped over the therapists' arms. Each therapist underwent 10 trials in which the 11-year-old investigator hovered her right hand, palm down, 8 to 10 cm above one hand of the therapist and then said "Okay." The Touch Therapist then stated which of his or her hands was nearer the experimenter's hand. Each subject was permitted to take as much or as little time as necessary to make each determination. Results showed 123 successes out of 280 trials, with the distribution of successes out of 10 trials distributed as follows:
|No. correct (out of 10)||Frequency||No. Correct|
We want to calculate a 95% confidence interval for the proportion of successes accounting for the cluster sample. (Each therapist is a cluster of 10 observations.) To do this, we must calculate the design effect (deff). (You will find deff = 1.12.) Now use EpiTable to calculate the 95% confidence interval taking into account the effect of clustering. Discuss your findings. Is there evidence to contradict the hypothesis of random selection ("detection") of a human energy field?