Continuous Outcome, Two Independent Groups

Background | Descriptive Statistics | Inferential Statistics | Power and Sample Size | Exercises


This chapter considers the comparison of a continuous outcomes from two independent groups.

Illustrative data WCGS.ZIP (Selvin, 1991, p. 41). To illustrate techniques, we consider cholesterol levels (mg/dl) in Type A and Type B men. Data are:

Type A: 233, 291, 312, 250, 246, 197, 268, 224, 239, 239, 254, 276, 234, 181, 248, 252, 202, 218, 212, 325
Type B: 344, 185, 263, 246, 224, 212, 188, 250, 148, 169, 226, 175, 242, 252, 153, 183, 137, 202, 194, 213 .

Data are structured as a numeric dependent (outcome) variable and a dichotomous independent (group) variable as CHOL and BEHAVIOR, respectively. The first three records and last record of this data set look like this:

---  ---- --------
  1   233 A
  2   291 A
  3   312 A
 40   213 B

Descriptive Statistics

Descriptive statistics for the two groups are computed with a two variable MEANS command applied as follows:


where <DV> represents the name of the dependent variable and <IV> represents the name of the independent variable.

For the illustrative data set, the following commands are issued:


Five sections of output are produced (a frequency table, summary statistics, ANOVA table, Bartlett's test, Kruskal-Wallis test). Summary statistics are printed below the frequency table. For the illustrative data, the summary statistics are:

                  MEANS of CHOL for each category of BEHAVIOR

BEHAVIOR          Obs      Total       Mean   Variance    Std Dev
A                  20
       4901    245.050   1342.366     36.638
B                  20
      4206     210.300   2336.747     48.340
Difference                           34.750

BEHAVIOR      Minimum     25%ile     Median     75%ile    Maximum       Mode
A             181.000    221.000    242.500    261.000    325.000    239.000
B             137.000    179.000    207.000    244.000    344.000    137.000

Thus, n1 = 20 n2 = 20 (listed under Obs.) and the type A men in the sample have higher mean scores than type B men (245.1 vs. 210.3). In addition, the type A group had less variability than the type B men (standard deviations: 36.6 vs. 48.3).

Inferential Statistics

Confidence Interval

The observed mean difference (34.750 in this instance) is the point estimate of expected mean difference �1-2. To calculate a 95% confidence interval for �1-2, first calculate (by hand) the standard error of the mean difference as follows: 

se = SQRT[(MSW)(1/n1 + 1/n2)]

where MSW is the Mean Square Within as reported in Epi Info's ANOVA table:

Variation          SS   df          MS  F statistic    p-value    t-value
Between     12075.625    1   12075.625        6.564   0.013853   2.562113
Within      69903.150   38    1839.557

Total       81978.775   39

For the illustrative data, se = SQRT[(1839.557)(1/20 + 1/20)] = 13.56 mg/dl.

A 95% confidence interval for �1 -2 is given by:

(mean difference) � (tn1+n2-2,.975)(se)

where  (mean difference) = mean1 - mean2, tn1+n2-2,.975 represents the 97.5th percentile of a t distribution with n1 + n2 - 2 degrees of freedom (click here for a t table), and se represents the standard error of the mean difference (described above). Thus, the 95% confidence interval for �1 -2 for the illustrative data = (245.05 - 210.30) � (t38,.975)(13.56) = 34.75 � (2.02)(13.56) = (7.4, 62.1) mg/dl. This interval places the population mean difference between 7.4 and 62.1 with 95% confidence.

Independent t Test

Epi Info calculates the equal variance independent t test for H0: �1 =2 in its ANOVA table:

Variation          SS   df          MS  F statistic    p-value    t-value
Between     12075.625    1   12075.625        6.564   0.013853   2.562113

Within      69903.150   38
Total       81978.775   39

Thus, data demonstrate  tstat = 2.56 with 38 degrees of freedom (p = .014). Most investigators would consider this "significant" evidence against H0.

Assumptions of Confidence Interval and t Test

The above confidence interval and test statistics assume data are (1) free of bias (information bias, selection bias, and confounding), (2) groups and individuals within groups are independent, (3) the sampling distribution of the mean difference is normal, and (4) variances in the two populations are equal (homoscedasticity). Although violation of assumptions (3) and (4) may results, numerous studies have shown that these methods allow for considerable departures from normality and equal variance while still providing stable results. The robustness of these last two assumptions is good when samples sizes are equal (n1 = n2), samples are large (n > 30), and a two-sided test is used (Zar, 1996, p. 128). Furthermore, statistical tests need not be realistic in order to be useful. 

Statistical models are sometimes misunderstood in epidemiology. Statistical models are never true. The question of whether a model is true is irrelevant. A more appropriate question is whether we obtain the correct scientific conclusion if we pretend that the process under study behaves according to a particular statistical model. (Zeger 1991).

Mann-Whitney / Kruskal-Wallis Test

The Mann-Whitney / Kruskal-Wallis test (for two sample) are non-parametric analogues of the independent t test. Epi Info computes the Kruskal-Wallis test as part of its MEANS command. Here are the results for the illustrative data:

Mann-Whitney or Wilcoxon Two-Sample Test (Kruskal-Wallis test for two groups)
Kruskal-Wallis H (equivalent to Chi square) =       6.333
                         Degrees of freedom =           1
                                    p value =    0.011853

Thus, c2stat = 6.33 with 1 degree of freedom (p = 0.012).

Comment: The Kruskal-Wallis procedure is slightly less powerful than the independent t test when data come from normally distributed populations. The loss of efficiency is surprising small when the test is used in non-normal populations.

Bartlett's Test

When addressing two samples, Bartlett's test addresses H0: s1 = s2, where si represents the variance in population i. Epi Info performs this test whenever the MEANS command is used. Here are the results for the illustrative data:

                  Bartlett's test for homogeneity of variance
    Bartlett's chi square =   1.404  deg freedom =  1   p-value = 0.236005

Thus, c2 = 1.40, df = 2, p = .24. This provides little or no support for rejecting H0.

Comment: Bartlett's test is reliable only when used in normal populations. When the population distribution is platykurtic, the true p value is less than the calculated p value, i.e., the test is conservative (Maurais & Ouimet, 1986). When population distribution is leptokurtic, the true p value is greater than calculated p value, i.e., the test is liberal . Because t tests are relatively reliable in the face of unequal variance, many statisticians question the use of Bartlett's test as a prequel test to the independent t test. Consider:
It has been shown that in the commonly occurring case in which group sizes are equal, or not very different, the [independent t test] is affected surprisingly little by variance inequalities. Since this test is also known to be very insensitive to non-normality it would be best to accept the fact that it can be used safely under most practical conditions. To make the preliminary test on variances is rather like putting to sea in a row boat to find out whether conditions are sufficiently calm for an ocean liner to leave port! (Box, 1953)

Power and Sample Size

Sample Size Requirements

To achieve 80% power for a = 0.05 (two-sided), each group should have:

n = (16 s� / d�) + 1

where d = a "difference worth detecting" and s = a good estimate of within-group standard deviation (e.g., sp). Suppose we want to detect a difference of 25 units and assume the standard deviation of the outcome variable is 45. Then, the required sample size per group, n = (16)(45�) / 25� + 1 = 52.84 @ 53.


Power is the probability of achieving a "significant" result under a given set of assumptions, assuming H0 is false. For example, we might ask "What is the probability of achieving statistical significance at a = .05 (two-sided) assuming �1 = 50, �2 = 40, s = 45, and n1 = n2 = 20. The answer to this is ".10," meaning the test had only a 10% of rejecting the incorrect alternative hypothesis. Try using the Web power calculator located at to calculate power for the type of problem presented in this chapter.


(1) TWOGRPS.ZIP. Scores from Two Groups. Two groups demonstrate the following scores on a psychological profile test:

Group 1: 86, 99, 96, 95, 72, 73, 95, 125, 97, 95
Group 2: 110, 126, 89, 106, 98, 105, 93, 127, 130, 92

Computerize these data remembering to create separate variables for SCORE and GROUP and, then, compute the descriptive and inferential statistics described in this chapter. Report on your findings using plain language.

(2) FEV.ZIP (Rosner, 1990, p 40; Tager et al., 1985). Data are from a respiratory health survey of children and adolescents. Codes in the file are as follows:
Variable Type Len Description
ID Integer 5 Identification number
AGE Integer 2 Age of participant at beginning of the study (years)
FEV Real (#.####) 6 Forced expiratory volume (liters/second)
HEIGHT Real (##.#) 4 Height (inches)
SEX Integer 1 Sex: 0 = female, 1 = male
SMOKE Integer 1 Current smoking status: 0 = non-smoker, 1 = smoker

Compare the smokers and non-smokers in this file with respect to their age.

Key to Exercises