Comp Public Health Stat (HS267) Lab Notes and Keys 4/17/07   

Jump to lab: 1  |  2  |  3  |  4  |  5  |  6  |  7  |  8  |  9  |  10

The lab is an important part of the course. You should complete each lab each week before consulting with these notes. I do not want students merely copying these notes, for this would defeat the purpose of the lab .(This is your chance to make errors and correct them.) The goal of the lab is not to achieve the correct "answer." Instead, the goal is for you to actively engage the logic involved in each data analysis.. The posting of these notes in merely a trial balloon. Please use these notes judiciously.

Lab 1: Variances and Means (pp. 3 - 11)

Lab 1, Part 1 (Basic stuff) 

pp. 3 - 5 calculate group means and standard deviations: 1 = 50; 2 =  50

Sum of squares

Obs

Values

Deviations

Squared Deviations

1

30

-20

400 

2

40

  -10

100 

3

50

 0

4

60

 10

100 

5

70

 20

400 

Sums

250

0

1000

 

s22 = SS2 / (n2 - 1) = 1000 / (5 - 1) = 250 

s2 =  sqrt(s22) = sqrt(250) = 15.81 [Compare the standard deviation of this group to that of group 1.] 

Note: The variance carries "units squared." The standard deviation carries the same units as the initial measurement.

pp. 5 - 6 Read carefully

Lab 1, Part 2 (Linoleic acid lowers LDL cholesterol) 

p. 7 Exploration with stemplot

Group 1                    Group 2
          |4|0
         |4|5
         |5|04
     9888|5|6
    43100|6|04
       75|6|
        0|7|
         (�1) 
       (mmol/m3) 

How do the groups compared? Group 2 has lower values on average. (I've underlined the locations of the medians in the stemplots.) The spreads seem comparable.

p. 8 Descriptive statistics (mean and standard deviation)

Calculate the mean and standard deviation of group 2 (p.8). You should show the work involved in calculations. Here's the final descriptive statistics table:

Group

n

mean  (mmol/m3)

s (mmol/m3)

1 (Rassias data)

12

6.192

0.3919

2 (Controls / fictitious)

7

5.271

0.8381 

p. 9 Test of variances F-ratio test of H0: 1222 ; Fstat = 0.83812 / 0.39192  = 4.573 w/ df1 = 7 - 1 = 6 and df2 = 12 - 1 = 11; 0.01 < P < 0.025. Significant? Yes.

p. 10 Test of means H0: �1 = �2  vs. H1: �1  �2  by Welch modified t test; SEmean dif  = (.39192 / 12 + .83812 / 7) = 0.3364; tstat = (6.192 - 5.271) / 0.3364 = 2.74; df by conservative hand-based method = 6 ), so  0.025 < P < 0.05. Significant? Yes.

p. 11 SPSS computations.  Scrutinize your output. We have covered all the statistics on this output except Levene's F. (We'll cover this in two weeks.)  Look at both "the equal variance assumed" and "equal variance not assumed" procedures. These test H0: �1 = �2 and provide confidence intervals for for 1 - �2.

Lab 2: ANOVA

p. 12 Data This page explain the setup for an analysis of variance (ANOVA) problem. Memorize all terms that feel shaky. 

pp. 13 - 14 Side-by-side boxplots (checking for outside values) 

Group 1 five-number summary: 9.1  10.7  11.8  12.0  12.1
IQR = Q3 - Q1 = 12.0 - 10.7 = 1.3
Fence(upper) = Q3 + 1.5(IQR) = 12.0 + 1.5(1.3) = 1.95
Fence(lower) = Q1 - 1.5(IQR) = 10.7 - 1.5(1.3) = 8.75)

Any outside? No

 

Group 2 five-number summary: 12.8   13.0   13.4   13.6   14.4   
IQR = 13.6 - 13.0 = 0.6

Fence(upper) = 13.6 + (1.5)(0.6) = 14.5

Fence(lower) = 13.0 - (1.5)(0.6) = 12.1

Any outside? No

 

Group 3 five-number summary:  8.5     8.6     9.2     9.6     9.8  [this was in error in the S06 printing of the workbook]

IQR = Q3 - Q1 =

Fence(upper) =

Fence(lower) =

Any outside?

How do groups compare? Locations (averages) demonstrate clear differences. There is little overlap in the groups. The small samples (ni = 5) preclude robust statements about shape and location.

p. 15 SPSS Explore command  Scrutinize the output!

Group

n

mean

standard deviation

1 (standard)

5

11.1140

1.2700

2 (junk)

5

13.440

0.6229

3 (health)

5

9.140

0.5814

p. 16 Analysis of variance (ANOVA). Pay attention to the notation used to express the null and alternative hypothesis. For this particular data set, H0: �1 = �2 = �3 vs. H1: at least one of the population means is different. In words, H0 = the three populations have equivalent means and H1 = H0 is false. Keep this in mind, for it makes no sense to test a hypothesis you can't remember.  

p. 16 - 17 Between group statistics Calculate the sum of squares between for the data set we have been considering. 

SSB = (5)(11.14 - 11.24)2 + (5)(13.44 - 11.24)2 + (5)(9.14 - 11.24)2 = 0.05 + 24.20 + 22.05 = 46.30 
dfB = 3 - 1 = 2 
s2B = 46.30 / 2 = 23.15.

p. 17 Within group statistics 

There are two errors in the section "Within group statistics." The formula for the Sum of Square within should be where dfi = ni � 1 and s2i is the variance in group i. The formula for the within-group variance is described properly with  words but the wrong equation appears. It should be . Sorry!

Calculate the sum of squares within for the data set.  SSW = (5-1)(1.27)2 + (5-1)(0.623)2 + (5-1)(0.581)2 = 6.452 + 1.553 + 1.350 = 9.37

The �degrees of freedom between� is equal to the total number of individuals (N) minus the number of groups (k). dfW = 15 - 3 = 12

The within-group variance (s2W) is the sum of squares within (SSW) divided by the degrees of freedom within (dfW). s2W = 9.37 / 12 = 0.78

My copy of the Lab Workbook includes 2 blank pages at this point -- a mistake in printing. 

p. 18 ANOVA table

Source

Sum of Squares

Degrees of freedom

Mean Squares

Between

46.300

2

23.150

Within

9.356

12

0.780

Fstat = 23.15 / 0.78 = 29.69 with 2 and 12 dfs; P < 0.001

Is the difference between groups significant? Yes

 

What does this suggest about the effect of diet? At least one of the diets is associated with a different amount of weight gain. 

p. 18 ANOVA in SPSS (label your output) 

 

p. 19 Conditions for inference  

 

Distributional assumptions

 

Description of the Normality assumption - ANOVA requires the sampling distribution of means to be  approximately Normal. The population distributions may deviate from Normality if the sample is large enough for the Central Limit Theorem to be effective.

 

Description of the equal variance assumption - The k populations have approximately equal variances allowing us to pool sample variances to come up with a single estimate of the "variance within." 

Validity assumptions

 

No selection bias - The samples represent independent simple random samples from the populations. (For experimental data, the samples represent random samples of eligible participants.)

 

No information bias - The data are accurate and measure what they say they do. 

 

No confounding - The groups being considered are similar with respect to all determinants of the outcome except for the explanatory factor  being studied.

Lab 3: ANOVA topics

p. 20 Data (Background; setting up post hoc comparisons)

p. 21 - 22 Least Squares Difference method

For the current data, s2W = 0.78 (from the ANOVA table)

Calculate the standard error of the differences... SE = sqrt([ 0.78* (1/5+ 1/5)]) = 0.5586 [When groups have equal sample sizes, you need to calculate only one SE. When groups have unequal sizes, the SEs will differ and need to be calculated for each comparisons.]

Calculates the t for testing H0: �1 = �2: tstat = (13.14- 3.44) / 0.5586 = -4.12 with dfw = N - k = 15  - 3 = 12; P < 0.01 

Calculate the t statistic for testing H0: 1 = �3: tstat = (13.14-9.14) / 0.5586 = 3.58 with dfw = N - k = 15  - 3 = 12; P  < 0.005

Calculate the t statistic for testing H0: �2 = �3: tstat = (13.44 - 9.14) / 0.5586 = 7.69 with dfw = N - k = 15  - 3 = 12; P close to 0

H0: �1 = �2

tstat = -4.12 with 12 df

P = 0.0014 

H0: �1 = �3

tstat = 3.58 with 12 df

P  = 0.0038

H0: �2 = �3

tstat = 7.69 with 12 df

P = close to 0

 

Interpret your results: The evidence against H0 is significant in each case. There was the most weight gain with the junk food (13.4 grams), and least with the health food (9.1 grams).

p. 23 Bonferroni's adjustment

 

For H0: �1 = �2; P = 0.0014 � 3 = .0042; highly significant ( reject H0)
For H0: �1 = �3; P = 0.0038 � 3 = .0114; significant 
For H0: �2 = �3; P @ .0000 � 3 = .0000; highly significant

 

At a family-wise " of 0.05, which of the above comparisons are significant? All

 

At a family-wise " of 0.01, which comparisons are significant? All but the second.

 

Label the output to connect test procedures with the output....Labeling not shown because of technical difficulty in posting hand markups.

 

p. 24  Levene's test (H0: 1222 = 32)     Levene's Fstat =  2.45     df1 = 2     df2 = 12     P = 0.13    Is the evidence that variances differ significant? No

p. 25 - 26 Read carefully

p. 27 - 28 Alcohol consumption and SES

What two [ANOVA] conditions have been violated? 1. The Normality condition. 2. The equal variance condition

What is the alternative hypothesis? H1: at least one of the SES populations has a different location in terms of alcohol consumptions

Run the Kruskal-Wallis procedure... 

P = 0.099

Do the central locations of the populations differ significantly? The difference is marginally significant.

Lab 4: Correlation and regression

p. 29 Which of these variables is the independent variable? CIG     Which variable is the dependent variable? LUNGCA

p. 30 Plot the data...  

 

Can the relation ... be described by a straight line? Yes

Does the relation appear to be positive, negative, or �nil�? Positive

Are there any outliers in the data? If yes, which country / countries? Possibly(?) the U.S.

p. 31: read carefully 

p. 32. There is a typographical error in the current edition of the Lab Workbook in formula at the top of the page. The correct formula is .

Calculate the correlation coefficient: r = (32717) / sqrt [(1432255)(1375)] = 0.737 = 0.74 

Is the correlation positive, negative, or about zero? positive

Describe the strength of the association for the data. strong correlation between cig. consumption and lung cancer mortality

Write the null hypothesis using statistical notation. H0: ρ = 0

Calculate these statistics... tstat = 0.737 / sqrt[(1 - 0.7372) /(11 - 2)] = 3.28; df = 11 - 2 = 9

Use your t table to determine:  0.002 < P < 0.01

Is the correlation significant at alpha = 0.05? Yes

Is it significant at alpha = 0.01? Yes

Note conditions required for inference. 

p. 33 Label your output by hand... See below. I've labeled two of the boxes for you.  You should be able to interpret all the items on this output.

p. 34 

Calculate the slope estimate: b = 32,717/1,432,225 = 0.0228

Calculate the intercept estimate: a = 20.55 - (0.0228)(603.64) = 6.79

Complete the regression model by filling in these blanks: = 6.79 + (0.0228)X

With a straight edge, draw the regression line on the scatter plot you created earlier this lab.


Although not requested in the Lab Workbook, the figure shows dotted lines for each residual.

 

Use your regression model to predict the lung cancer mortality in a country with annual cigarette consumption level of 800 cigarettes per capita. = 6.79 + (0.0228)(800) = 25.03 [per 100,000 p-yrs]

...an increase of one cigarette per capita predicts an increase of 0.0228 lung cancer cases per 100,000 

An increase of 100 cigarettes per capita predicts an increase of 2.28 lung cancer cases per 100,000 [fill in the blank].

 

p. 35: Typo para. 3 line 2 change "through the X" to ""through "y-bar"]

On your scatterplot, draw this residual as a vertical line from the data point for the U.S. to the regression line. 

 

I have not drawn the line for on the above plot (as requested in the Lab Workbook) b/c of technical difficulties. You need this to see a regression component of the point.

p. 36 Complete the ANOVA table for the current data.

Source

Sum of Squares

Degrees of freedom

Mean Squares

Regression

747.409

df1 = 1

 747.409

Residual

627.319

df2 =  

 69.702 

Total

1374.727

df =  10 

p. 37 Fstat = 747.409 / 69.702 = 10.723; 0.001 < P < 0.01

p. 38 Confidence interval for the slope

Variance of Y given X: s2Y|x = Residual MS = 69.702

Standard error of the slope: SEb = (69.702 / 1,432,254.545)  = 0.00698

95% CI for for the slope parameter beta = 0.0288 � (t9,.975)(0.00698) = 0.0288 � (2.26)(0.00699) = 0.0288 � 0.0159 = (0.0070, 0.0386)

p. 39 Label your output showing the values for s2Y|x (Mean Square Residual), SEb, a, b, P, and confidence limits for b

p. 40 water.sav -- Plot the graph with SPSS

How would you describe the relation in words? The scatter plot reveals an strong curved negative relation between fluoride levels and cavity rates. The steepest decline occurs between 0 and 1 ppm of fluoride. The decline levels off after this point. There is one clear outlier with value (0.1, 37)--it turns out this was a data entry error.

p. 41 Plot the ln-ln transformed data with outlier removed

Is the transformed relation linear? Yes.

r2 = 0 .95 [Typo - The superscript 2 was initially printed as a subscript 2.]

b = -0.409

Regression equation is ln() = 5.805 + (-0.409)lnX (not requested, but interesting)

p. 42 Plot [range restricted] data with outlier removed

r2 = 0.86        b = -528.1

Which model, the ln-ln transformed model or the range-restricted model, has a better fit as determined by r2? The ln-ln transformed data.  

Which model do you prefer? See footnote p. 42.

Lab 5: Cross-tabulated counts

p. 43 - read carefully

p. 44 R-by-C table. Fill in the table with the counts you just cross-tabulated.

Calculate the marginal percents for the row variable:

Comment: It should be noted that marginal percents represent population prevalences only where data are derived by a simple random sample of the population. In many instances, however, we will intentionally select a set number of observations per group in order to maximize the power of comparisons. Under such circumstances, the marginal percents will not represent population prevalences. 

p. 45 labeled output is requested

p. 46 read carefully

p. 47 

Describe the relationship (not requested in original printing): The low SES group has the highest prevalence of smoking. Prevalence decreases at SES 2 and again at SES 3, leveling off thereafter. 

p. 48 read carefully

p. 49 read carefully

p. 50 expected values:

SMOKE

 

SES

1

2

Total

1

15.30

34.70

50

2

26.01

58.99

85

3

37.33

84.67

122

4

82.92

188.08

271

5

17.44

39.56

57

Total

179

406

585

 

p. 51 chi-square contributions: 

SMOKE

SES

1

2

1

1.444 0.637

2

1.385 0.610

3

0.297 0.131

4

0.578 0.266

5

0.011 0.00

 

p. 52   

p. 52 SPSS output - Please label output to identify the statistics you calculated in lab. (Output is not labeled here b/c of the technical difficulty involved when working with graphics files)

Lab 6: Relative risks and odds ratios

p. 54. Read carefully. 2 = 27 / 339 = 0.0796 [approx. 8%]

p. 55 Read carefully

p. 56 Calculate the relative risk for the prison data: = 0.4485 / 0.796 = 5.6315 @ 5.63  

Comment: Data for the lab problem comes from Smith, P. F., Mikl, J., Truman, B. I., Lessner, L., Lehman, J. S., Stevens, R. W., et al. (1991). HIV infection among women entering the New York State correctional system. Am J Public Health, 81 Suppl, 35-40. In this typical article, the prevalence of HIV in group 1 (IVDU) is 0.4485. The prevalence in group 2 (non-users) is 0.0796. You don�t need a stat course to tell that this is a big difference! On p. 56 of the Lab Manual we are asked to make the prevalences into a ratio (RR-hat): RR-hat = 0.4485 / 0.796 = 5.6315 = 5.63, quantifying that the prevalence of HIV in the �exposed� group is more than 5 times that of the non-exposed group. Note that the prevalence does NOT increase 5.63 times with exposure. The increase is 4.63 fold, not 5.63 fold.  

p. 57  ln()= ln(5.6315) = 1.728;  SE = 0.208; 95% confidence intervals for lnRR = 1.728 � (1.96)(0.208) = (1.321, 2.137)

p. 58 Take the anti-logs... 95% CI for RR = e(1.321, 2.137) = (3.75, 8.47).

p. 59 SPSS output. Please label the output with the statistics you calculated in this lab. (Output is not labeled in these html notes because of the technical difficulty involved.)

p. 60 Read carefully

p. 61 = 1.961 = 1.96; ...how much does tobacco exposure at this level increase the risk of esophageal cancer? 96%
ln
= ln(1.961) = 0.6735; SElnOR = 0.1768. 

p. 62 The 95% confidence for lnOR = 0.6735 � (1.96)(0.1768) = 0.6735 � 0.3465 = (0.327, 1.020); 95% CI for OR = e(0.327, 1.020) = (1.39, 2.77)
Interpret your results. The odds ratio parameter is between 1.39 and 2.77, with 95% confidence.

pp. 63 - 64. SPSS output print and label and directed

p. 65 Matched-pairs (fruit and vegetable consumption and recurrence of colon polyps)

Point estimate: = 45 / 24 = 1.875 = 1.88. Low fruit/veggie consumption is associated with an 88% increase in risk.

Confidence interval:

Yes, data support the theory that low-fruit and vegetable consumption is a risk factor for colon polyps.

Lab 7: Stratified analysis: confounding and interaction (partial)

Source of data is Bickel, P. J., Hammel, E. A., & O'Connell, W. (1975). Sex bias in graduate admission: data from Berkeley. Science, 187, 398-404.

p. 66 - Read carefully. 

pp. 67 - 70 SKIP (Spring '07)

p. 71 

Calculate the incidence of acceptance in males. 1 = 534 / 1198 = 0.4457
Calculate the incidence of acceptance in females. 2 = 113 / 449 = 0.2517
Does there appear to be a gender bias? Yes. How so? Favoring males

p. 72 Read carefully!

p. 73 

Calculate the incidence of acceptance for males who applied to Major F:  1, Major F  = 22 / 373 = 0.05989
Calculate the incidence of acceptance for females who applied to Major F: 2, Major F  = 24 / 341 = 0.07038
The RR of acceptance for males to major F is Major F  = 0.05989 /  0.07038 = 0.84
Does there appear to be any gender bias? Yes. If so, how? Favoring females
Read the remainder of the page carefully! In class we will see how to combine these two strata-specific RRs:  Major F  =  0.75 and Major F  =  0.84

Lab 8

Lab 9

Lab 10