- Whereas ANOVA is used to analyze the relationship between a categorical explanatory variable and quantitative response variable, correlation and regression are used to analyze the relation between a _________________ explanatory variable and ____________________ response variable.
- What symbol denotes the correlation coefficient in the data?
- What symbol denotes the correlation coefficient parameter?
- A
*t*statistic for a correlation coefficient has this many degrees of freedom. - What is
*bivariate Normality*? - What is the range of possible values for
*r*? - Assuming linearity, a correlation coefficient of 0.79 means the correlation is __________ [choices: positive, negative, non-existent] and ___________________ [choices: weak, moderate, strong]
- Assuming linearity, a correlation coefficient of -0.25 means the correlation is __________ and ___________________.
- Why
*not*calculate*r*when data are not linear? - Do you have to assume bivariate Normality to test the correlation coefficient for significance?
- Do you have to assume bivariate Normality to use the sample correlation coefficient to describe the linear trend?
- Write the null and alternative hypotheses for testing
*r*. (Use statistical notation.) - Why is it important to scrutinize the scatter plot before
calculating
*r*? -
**Vocabulary**: independent variable, dependent variable, scatterplot,*r,*r (rho), form, direction, strength, outlier, linearity, confounding, bivariate Normal, coefficient of determination (*r*^{2}), correlation matrix (exercise 14.10)

**14.1 Distinctions. **Identify which of the statements are
true and which are false.

- Correlation coefficient
*r*quantifies the relation between quantitative variables*X*and*Y*. - Correlation coefficient
*r*quantifies the*linear*relation between quantitative variables X and Y. - The closer
*r*is to +1, the stronger the linear relation between X and Y. - The closer
*r*is to -1 or +1, the stronger the linear relation between X and Y. - If
*r*is close to zero,*X*and*Y*are unrelated. - If
*r*is close to zero,*X*and*Y*are not related in a linear way. - The value of
*r*changes when the units of the data are changed. - The value of
*r*does not change when the units of measure are changed.

**14.2 Memory of food intake**.

**14.3 Doll's ecological study of smoking and lung
cancer. **In 1955,
Richard Doll published an ecological study of smoking and lung
cancer.
Smoking was measured as per capita cigarette consumption in 1930 (

**(A)** Construct a scatterplot of the relation between cigarette consumption and lung cancer.
Consider the form, direction, and strength of the relationship. Are
there any outliers? [Hints: 1) Make certain you put the explanatory variable on the horizontal axis.
2) se * The APA
Publication Guide* §3.75 - §3.77 for guidelines on figure production.]

**
(B)** Calculate the correlation coefficient for the problem. Interpret this
statistic.

**(C) ** Test the correlation coefficient for significance. Show all
hypothesis testing steps (null hypothesis statement, test statistic, *P*
value, conclusion).

**
(D)** Optional: Replicate all analyses in SPSS. Label your output. [Menu choices: `Graph >
Scatter` and `Analyze > Correlate > Bivariate`.]

**
(E)** What % of `LUNGCA` is "explained" by `CIG`?

` i COUNTRY CIG LUNGCA
1 USA 1300 20
2 Great Brit 1100 46
3 Finland 1100 35
4 Switzerland 510 25
5 Canada 500 15
6 Holland 490 24
7 Australia 480 18
8 Denmark 380 17
9 Sweden 300 11
10 Norway 250 9
11 Iceland 230 6`

**14.4 Sodium and blood pressure**.

**
(A) **Which variable is the explanatory variable in this analysis? Which is
the response variable?

**
(B) **Construct a scatter plot of these data. (Use graph paper or a computer
to generate your plot. Axis-labels should conform to APA style guidelines.)
*
Discuss * your plot by considering its form, direction of association, and
strength of association. Are there any outliers?

**
(C)** Compute *r.* Interpret this statistic.

**(D) **What % of `BP` is explained by `SODIUM`?

**
(E) ** Test the correlation for significance. Show all hypothesis testing
steps (null hypothesis, test statistic, *P *value, conclusion).

` i SODIUM
BP
1 6.8 154
2 7.0 167
3 6.9 162
4 7.2 175
5 7.3 190
6 7.0 158
7 7.0 166
8 7.5 195
9 7.3 189
10 7.1 186`

**14.5 *** Gravid iguanas. *
Data on
post-partum body weight (kilograms) and the number of eggs produced by
gravid iguanas are shown below (Hampton, 1994, p. 157; iguana.sav).

**
(A) ** Construct a scatter plot of the data. (Make certain you put the
explanatory variable on the horizontal axis.) Interpret your plot.

**
(B)** Calculate the correlation coefficient. Interpret this statistic.

**
(C) ** Test the correlation coefficient for statistical significance.

*i* WEIGHT
EGGS

`1 0.90 33`

`2 1.55 50`

`3 1.30 46`

`4 1.00 33`

`5 1.55 53`

`6 1.80 57`

`7 1.50 44`

`8 1.05 31`

`9 1.70 60`

**14.6 Graduation rates at
Big Ten universities**. The most reliable factor that predicts graduate is scholastic aptitude
and motivation. To explore quantify this fact, a researcher collects data on many
factors. Data are stored in bigten.sav. Graduation rates
by university (percentage of students graduating within 5
years of entry) are stored in the variable

**(A) ** Plot these data. Interpret your plot.

**
(B) ** Calculate *r. *Interpret this statistitic.

**(C) **Test it for statistical significance. Interpret your results.

**(D)** Calculate *r*^{2}. What does this tell you about
the variability of graduation rates?

__UPERCENT ACT__

`76.2 27`

`57.6 24`

`55.4 24`

`59.7 23`

`86.0 28`

`46.2 22`

`66.7 23`

**14.7 Occupational study of smoking and lung cancer. **An
occupational health study in England looked at the relation between cigarettes smoked and lung
cancer mortality in 25 different occupational groups. The explanatory variable (

(A)PlotSMRagainstSMOKING. Interpret the plot (i.e., consider its form, direction, strength, and if any outliers are present).

(B)Compute the correlation coefficient and interpret this result.

(C)Test the correlation for significance. State the null hypothesis, test statistic, itsdf, andPvalue. State your conclusion.

**14.8 Maternal mortality and health care during birth**. This study explored the relation between the
percentage of births attended by physicians, nurses, and midwives (

COUNTRY ATTENDED MAT_MORTBangladesh 5 600

Chile 98 67

Iran 70 120

Kenya 50 170

Nepal 6 830

Netherlands 100 10

Nigeria 37 800

Pakistan 35 500

Panama 96 60

United States 99 8

Vietnam 95 120

**(A) **What is the independent variable in this
analysis? What is the dependent variable?

**(B) **Plot the data as a scatterplot. Interpret what you see (form, direction, strength,
outliers if any). Make certain your plot is accurate and labeled in a way that is kind to your reader.
[You are
encouraged to use computational tools when analyzing your data.]

**(C) **Calculate *r*. Interpret this statistic.

**(D) **Test the correlation for statistical significance. Show all
hypothesis testing steps.

**(E) **Identify lurking variables that may confound and observed
relationship. Explain how confounding may occur.

**14.9 Need and demand for mental health care.**
This example uses data from a 1854 study on
mental health care in the fourteen counties in Massachusetts in the prior
century. The study conducted by Edward
Jarvis. Jarvis, then president of the American Statistical Association. The explanatory
variable is the reciprocal of the distance (in miles

(A)Create a scatterplot of the relation between PHOME and REC_DIST. Describe the relationship. Are there any outliers?

(B)Calculate the correlation coefficient using all 14 data points.

(C)Nantucket is clearly an outlier in this data set. Remove this outlier from the dataset and recalculate the correlation coefficient. Did this improve the "fit" of the correlation model?

COUNTY |
PHOME |
REC_DIST |

BERKSHIRE | 77.00 | .01031 |

FRANKLIN | 81.00 | .01613 |

HAMPSHIRE | 75.00 | .01852 |

HAMPDEN | 69.00 | .01923 |

WORCESTER | 64.00 | .05000 |

MIDDLESEX | 47.00 | .07143 |

ESSEX | 47.00 | .10000 |

SUFFOLK | 6.00 | .25000 |

NORFOLK | 49.00 | .07143 |

BRISTOL | 60.00 | .07143 |

PLYMOUTH | 68.00 | .06250 |

BARNSTABLE | 76.00 | .02273 |

NANTUCKET | 25.00 | .01299 |

DUKES | 79.00 | .01923 |

**14.1****0****
Cancer
correlates**

District of Columbia in 1960 and death rates for various forms of cancer. The data, originally from Fraumeni et al.(1968), can be download as an SPSS data set or text file by right-clicking on the highlighted text. Use SPSS to calculate correlation coefficients for each variable pairing. Interpret the correlation coefficients. Which cancers are associated with smoking?

__Variable__
__Description__

CIG
cigarettes sold per capita

BLAD
bladder cancer deaths per 100,000

LUNG
lung cancer deaths per 100,000

KID
kidney cancer deaths per 100,000

LEUK
leukemia cancer deaths per 100,000

**14.11** **Atherosclerotic heart
disease as a function of fat calories **(fat_cal.sav)**. **Following World War II, it
became clear that northern European countries with high dietary fat consumption
were experiencing notable increases in what was then called degenerative heart
disease. Data in this exercise are a fictionalized version of data from early
ecological studies reported by Keys (1952, also see EKS p. 195). Data for
calories from fat as a % of total calories (FAT_CAL) and CHD mortality per 1000 50- to
59-year-olds are:

COUNTY |
FAT_CAL |
CHD |

Japan | 8 | 0.5 |

Italy | 20 | 1.4 |

England | 33 | 3.8 |

Australia | 36 | 5.5 |

Canada | 37 | 5.7 |

USA | 39 | 7.1 |

(A) Which of the variables in this data set is the independent variable? Which is the dependent (response) variable?

(B) Plot the data.

(C) Can the relation be described with a straight line?

(D) ...to be continued...

Key to Odd Numbered Problems Key to Even Numbered Problems (may not be posted)