14: Correlation  3/14/07

Review Questions

  1. Whereas ANOVA is used to analyze the relationship between a categorical explanatory variable and quantitative response variable, correlation and regression are used to analyze the relation between a _________________ explanatory variable and ____________________ response variable.
  2. What symbol denotes the correlation coefficient in the data? 
  3. What symbol denotes the correlation coefficient parameter?
  4. A t statistic for a correlation coefficient has this many degrees of freedom.
  5. What is bivariate Normality?
  6. What is the range of possible values for r?
  7. Assuming linearity, a correlation coefficient of 0.79 means the correlation is __________ [choices: positive, negative, non-existent] and ___________________ [choices: weak, moderate, strong]
  8. Assuming linearity, a correlation coefficient of -0.25 means the correlation is __________ and ___________________.
  9. Why not calculate r when data are not linear?
  10. Do you have to assume bivariate Normality to test the correlation coefficient for significance? 
  11. Do you have to assume bivariate Normality to use the sample correlation coefficient to describe the linear trend? 
  12. Write the null and alternative hypotheses for testing r. (Use statistical notation.)
  13. Why is it important to scrutinize the scatter plot before calculating r?
  14. Vocabulary: independent variable, dependent variable, scatterplot, r, r (rho), form, direction, strength, outlier, linearity, confounding, bivariate Normal, coefficient of determination (r2), correlation matrix (exercise 14.10)

Exercises

14.1 Distinctions. Identify which of the statements are true and which are false.

  1. Correlation coefficient r quantifies the relation between quantitative variables X and Y.
  2. Correlation coefficient r quantifies the linear relation between quantitative variables X and Y. 
  3. The closer r is to +1, the stronger the linear relation between X and Y. 
  4. The closer is to -1 or  +1, the stronger the linear relation between X and Y. 
  5. If r is close to zero, X and Y are unrelated.
  6. If r is close to zero, X and Y are not related in a linear way.
  7. The value of r  changes when the units of the data are changed.
  8. The value of r does not change when the units of measure are changed.

14.2 Memory of food intake.  Retrospective studies on diet and health often rely on recall of distant dietary histories for at least part of their data. It is well known that the accuracy of such information is suspect. To study this issue, Dwyer et al. (1989) asked middle-aged adults (median age 50) to recall food intakes at ages 5-7 years, 18 years, and 30 years using food frequency questionnaires. This information was then compared to historical information collected during these earlier time periods. Correlation between recalled and historical consumption for foods and food groups rarely exceeded r = 0.3. Based on this information, what do you conclude about the reliability of memory as a means to measure past food intake?

14.3 Doll's ecological study of smoking and lung cancer.  In 1955, Richard Doll published an ecological study of smoking and lung cancer. Smoking was measured as per capita cigarette consumption in 1930 (CIG). Lung cancer mortality per 100,000 person-years in 1950 (LUNGCA). Data may be downloaded as doll-ecol.sav and are shown in the table below. (There are 11 observations. The data table  may split across pages.) 

(A) Construct a scatterplot of the relation between cigarette consumption and lung cancer. Consider the form, direction, and strength of the relationship. Are there any outliers? [Hints: 1) Make certain you put the explanatory variable on the horizontal axis. 2) se The APA Publication Guide §3.75 - §3.77 for guidelines on figure production.] 
(B) Calculate the correlation coefficient for the problem. Interpret this statistic.
(C) Test the correlation coefficient for significance. Show all hypothesis testing steps (null hypothesis statement, test statistic, P value, conclusion).
(D) Optional: Replicate all analyses in SPSS. Label your output. [Menu choices: Graph > Scatter and Analyze > Correlate > Bivariate.] 
(E) What % of LUNGCA is "explained" by CIG?

 i   COUNTRY       CIG    LUNGCA
 1   USA          1300      20
 2   Great Brit   1100      46
 3   Finland      1100      35
 4   Switzerland   510      25
 5   Canada        500      15
 6   Holland       490      24
 7   Australia     480      18
 8   Denmark       380      17
 9   Sweden        300      11
10   Norway        250       9
11   Iceland       230       6

14.4 Sodium and blood pressure. Data (n = 10) on daily SODIUM intake (mg) and systolic blood pressure (BP; mm Hg) are stored in na-bp.sav and are shown below. 

(A) Which variable is the explanatory variable in this analysis? Which is the response variable?
(B) Construct a scatter plot of these data. (Use graph paper or a computer to generate your plot. Axis-labels should  conform to APA style guidelines.) Discuss your plot by considering its form, direction of association, and strength of association. Are there any outliers?
(C) Compute r. Interpret this statistic. 
(D) What % of BP is explained by SODIUM?
(E) Test the correlation for significance. Show all hypothesis testing steps (null hypothesis, test statistic, P value, conclusion).

    i   SODIUM       BP
    1      6.8      154
    2      7.0      167
    3      6.9      162
    4      7.2      175
    5      7.3      190
    6      7.0      158
    7      7.0      166
    8      7.5      195
    9      7.3      189
   10      7.1      186

14.5 Gravid iguanas. Data on post-partum body weight (kilograms) and the number of eggs produced by gravid iguanas are shown below (Hampton, 1994, p. 157; iguana.sav).  

(A) Construct a scatter plot of the data. (Make certain you put the explanatory variable on the horizontal axis.) Interpret your plot.  
(B) Calculate the correlation coefficient. Interpret this statistic.
(C) Test the correlation coefficient for statistical significance. 

i   WEIGHT      EGGS
1     0.90       33
2     1.55       50
3     1.30       46
4     1.00       33
5     1.55       53
6     1.80       57
7     1.50       44
8     1.05       31
9     1.70       60

 
14.6 Graduation rates at Big Ten universities. The most reliable factor that predicts graduate is scholastic aptitude and motivation. To explore quantify this fact, a researcher collects data on many factors. Data are stored in bigten.sav. Graduation rates by university (percentage of students graduating within 5 years of entry) are stored in the variable UPERCENT. The average ACT scores of incoming freshman at is the predictor variable for this analysis. 

(A) Plot these data. Interpret your plot. 
(B) Calculate r. Interpret this statistitic.
(C) Test it for statistical significance. Interpret your  results.  
(D) Calculate r2.  What does this tell you about the variability of graduation rates?

UPERCENT   ACT
76.2        27
57.6        24
55.4        24
59.7        23
86.0        28
46.2        22
66.7        23

14.7 Occupational study of smoking and lung cancer. An occupational health study in England looked at the relation between cigarettes smoked and lung cancer mortality in 25 different occupational groups. The explanatory variable (SMOKING) was standardized to 100 when men in the group had typical smoking rates for their age. The response variable was the standardized mortality ratio (SMR) for lung cancer mortality in that occupational group. Data can be seen by clicking here lib.stat.cmu.edu/DASL/Datafiles/SmokingandCancer.html and are stored online in the file occupational_smr.sav

(A) Plot SMR against SMOKING. Interpret the plot (i.e., consider its form, direction, strength, and if any outliers are present).
(B) Compute the correlation coefficient and interpret this result.
(C) Test the correlation for significance. State the null hypothesis, test statistic, its df, and P value. State your conclusion.

14.8 Maternal mortality and health care during birth. This study explored the relation between the percentage of births attended by physicians, nurses, and midwives (ATTENDED) and maternal mortality per 100,000 live births (MAT_MORT). The values for a random sample of 11 countries are shown below and are stored online in ../datasets/mat_mort.sav. Data are a sample from Pagano & Gauvreau (2000, p. 407) as originally published in United Nation's Children's Fund (1994) [link to review of the UN publication]. 

COUNTRY    ATTENDED MAT_MORT
Bangladesh       5   600
Chile           98    67
Iran            70   120
Kenya           50   170
Nepal            6   830
Netherlands    100    10
Nigeria         37   800
Pakistan        35   500
Panama          96    60
United States   99     8
Vietnam         95   120

(A) What is the independent variable in this analysis? What is the dependent variable? 
(B) Plot the data as a scatterplot. Interpret what you see (form, direction, strength, outliers if any). Make certain your plot is accurate and labeled in a way that is kind to your reader. [You are encouraged to use computational tools when analyzing your data.] 
(C) Calculate r. Interpret this statistic.
(D) Test the correlation for statistical significance. Show all hypothesis testing steps.
(E) Identify lurking variables that may confound and observed relationship. Explain how confounding may occur. 

14.9 Need and demand for mental health care. This example uses data from a 1854 study on mental health care in the fourteen counties in Massachusetts in the prior century. The study conducted by Edward Jarvis. Jarvis, then president of the American Statistical Association. The explanatory variable is the reciprocal of the distance (in  miles-1) to the nearest mental healthcare center (REC_DIST). The response variable is the percent of patients cared for in the home (PHOME). The relation between the percentage of patients cared for at home and distance to the nearest health care center remains important today--it is still recommended that numerous small mental hospitals be erected at scattered locations rather than having one large central facility [Source: http://lib.stat.cmu.edu/DASL/Stories/lunatics.html and http://lib.stat.cmu.edu/DASL/Datafiles/lunaticsdat.html

(A) Create a scatterplot of the relation between PHOME and REC_DIST. Describe the relationship. Are there any outliers?
(B) Calculate the correlation coefficient using all 14 data points.
(C) Nantucket is clearly an outlier in this data set. Remove this outlier from the dataset and recalculate the correlation coefficient. Did this improve the "fit" of the correlation model?

COUNTY PHOME REC_DIST
BERKSHIRE 77.00 .01031
FRANKLIN 81.00 .01613
HAMPSHIRE 75.00 .01852
HAMPDEN 69.00 .01923
WORCESTER 64.00 .05000
MIDDLESEX 47.00 .07143
ESSEX 47.00 .10000
SUFFOLK 6.00 .25000
NORFOLK 49.00 .07143
BRISTOL 60.00 .07143
PLYMOUTH 68.00 .06250
BARNSTABLE 76.00 .02273
NANTUCKET 25.00 .01299
DUKES 79.00 .01923

14.10       Cancer correlates. Statistical packages are able to calculate correlations for multiple pairings of variables, often reporting their findings in a correlation matrix. Correlatoin matrices report correlation coefficients for all pairing of quantitative variables. We are going to create a correlation matrix for the  per capita numbers of cigarettes smoked (sold) in 43 states and the 
District of Columbia in 1960 and death rates for various forms of cancer. The data, originally from Fraumeni et al.(1968), can be download as an SPSS data set or text file  by right-clicking on the highlighted text. Use SPSS to calculate correlation coefficients for each variable pairing. Interpret the correlation coefficients. Which cancers are associated with smoking?

            Variable            Description
CIG                  cigarettes sold per capita
BLAD        bladder cancer deaths per 100,000
LUNG                lung cancer deaths per 100,000
KID         kidney cancer deaths per 100,000
LEUK        leukemia cancer deaths per 100,000

14.11    Atherosclerotic heart disease as a function of fat calories (fat_cal.sav)Following World War II, it became clear that northern European countries with high dietary fat consumption were experiencing notable increases in what was then called degenerative heart disease. Data in this exercise are a fictionalized version of data from early ecological studies reported by Keys (1952, also see EKS p. 195). Data for calories from fat as a % of total calories (FAT_CAL) and CHD mortality per 1000 50- to 59-year-olds  are: 

COUNTY FAT_CAL CHD
Japan 8 0.5
Italy 20 1.4
England 33 3.8
Australia 36 5.5
Canada 37 5.7
USA 39 7.1

(A) Which of the variables in this data set is the independent variable? Which is the dependent (response) variable?
(B) Plot the data. 
(C) Can the relation be described with a straight line?
(D) ...to be continued...

Key to Odd Numbered Problems                         Key to Even Numbered Problems (may not be posted)