| Background | Scatter Plot | Correlation Coefficient | Regression Model | Importance of Visualization | Exercises

In preceding chapter we considered the relation between a continuous dependent variable (*Y*) and categorical independent variable (*X*)
using ANOVA. In this chapter we consider the relation between two continuous variables *X* and *Y* using correlation and regression.

Although correlation and regression are used for similar purposes, some distinctions exist. Correlation quantifies the relation between
variable *X* and variable *Y *with a unit-free correlation coefficient. Regression, on the other hand, quantifies the relation between *X *and
*Y* by predicting the average change in *Y *per unit *X*. Therefore, the choice of whether we use regression or correlation depends on the
questions we ask of the data When the researcher wants to measure the relation between *X* and *Y* in unit-free terms, correlation is
used.

Other distinctions between correlation and regression exist. A sample may be selected so that the distribution of independent variable
*X* is fixed by the researcher or so that it is left free to vary. For example, in a study of age and blood pressure, a researcher may select a
fixed number of people from various age categories (e.g., five 40-year-olds, five 50-year-olds, and so on), thus fixing the distribution
of *X*. This is called fixed-effects or Model I sampling. In contrast, the study may sample people at random, determining their age after
they are selected. This is called a random-effects model or Model II. Regression can be used with fixed-effects sampling and
random-effects sampling. Correlation should be used only with random-effects sampling.

** Illustrative Data. **The data set

`REC SCHOOL$ X Y`

`--- ------------- ---------- ----------`

` 1 Fair Oaks 50 22.1`

` 2 Strandwood 11 35.9`

` 3 Walnut Acres 2 57.9`

` 4 Discov. Bay 19 22.2`

` 5 Belshaw 26 42.4`

` 6 Kennedy 73 5.8`

` 7 Cassell 81 3.6`

` 8 Miner 51 21.4`

` 9 Sedgewick 11 55.2`

` 10 Sakamoto 2 33.3`

` 11 Toyon 19 32.4`

` 12 Lietz 25 38.4`

The first step in the analysis should be to plot the data in the form of a scatter plot. To draw a scatter plot with *Epi Info*, first `READ `the
data set into the current session and then issue the command:

`EPI6> SCATTER <X> <Y>`

where `<X>` represents the name of the independent variable and `<Y>` represents the name of the dependent variable. For example, to
draw a scatter plot for the illustrative data, the following commands are issued:

`EPI6> READ BICYCLE`

`EPI6> SCATTER X Y`

Click here to display the scatter plot produced by *Epi Info*. An inverse correlation is evident.

Comment:Observations that do not fit the general data pattern of the data are calledoutliers. Identifying and dealing with outliers is an important although often misunderstood statistical undertaking. In some instances, outliers should be excluded, while in others they should remain with the data. When a researcher chooses to leave the outlier in the analysis, it should be carefully scrutinized -- often, there is something to learned from the outlier. For insights into how to deal with outliers, please click: www.tufts.edu/~gdallal/out.htm (Kruskal, 1960).

Pearson's product-moment correlation coefficient** **(*r*) quantifies the relation between *X* and *Y* in unit-free terms. When *r* @ 0, there is no
linear correlation between *X* and *Y*. When all points fall on a straight line with an upward slope, *r* = +1. When all points fall on a
straight line with a downward slope, *r* = -1.

The correlation coefficient is calculated with the command issued as follows:

`EPI6> REGRESS <Y> <X>`

where <X> is the name of the independent variable and <Y> is the name of the dependent variable. Notice that the order of `<X> `and
`<Y>` in the REGRESS command is opposite that of the SCATTER command!

For the illustrative data output is:

`Correlation coefficient: r = -0.85`

` r^2= 0.72`

`95% confidence limits: -0.96 < R < -0.54`

:Notes

(1) The above output showsr= -.85, showing a strong negative correlation betweenXandY.

(2) The 95% confidence interval locates correlation coefficient parameter r with 95% confidence. Report this as "95% confidence interval for r is (-0.96, -0.54).

(3) The statistic labeledr^2is thecoefficient of determination.This represents the proportion of the variance in one variable explained by the other. The illustrative example demonstrates a coefficient of determination of 0.72, suggesting 72% of the variance in helmet use is explained by socioeconomic status.

Regression models predicts the value of *Y* for a given value of *X* according to the equation:

*Y* = (INTERCEPT) + (SLOPE)*X* + (RANDOM ERROR)

In recognizing the above equation as that of a line, the INTERCEPT identifies where the line crosses the *Y *axis and SLOPE represents
the line's include or "change in *Y* per unit *X*." The RANDOM ERROR term allows for statistical scatter around the line and is assumed
to be normally distributed with a mean of 0 and variance of s^{2}. Thus, regression equation describes a line that travels through the
scatter cloud while accounting for random error in the data.

So how does one calculate the line's slope and intercept? The method we uses minimizes the sum of the RANDOM ERROR values
around the line and is thus called the *least squares line*.

The least square line is calculated by the `REGRESS` command with coefficients reported as follows:

` B 95% confidence Partial`

`Variable Mean coefficient Lower Upper Std Error F-test`

`X 30.8333 -0.5386091 `` -0.746144 -0.331074`` `` 0.105885 25.8748`

`Y-Intercept 47.4904464`` `

The `B coefficient`** **for X is the slope of the model. This predicts the average change in *Y* per unit *X*. For the illustrative example
the slope of -0.54 predicts a 0.54 *decrease *in *Y* with that each unit of *X*.

(2) The `B coefficient`** **for the `Y-intercept` for this model is 47.4904. Therefore, the regression model is: *Y* = 47.49 +
(-0.54)*x _{i}* + (RANDOM ERROR).

Notes:

(1) This equation can be used to predict helmet use rates in various communities. For example, a neighborhood in which half the children receive free- lunches (i.e.,X= 50),Y= 47.49 + (-0.54)(50) = 20.5.

(2) A 95% confidence interval for b_{1}is provided by the REGRESS command. For the illustrative example, the 95% CI for b_{1}= (-0.746, -0.331).

Exploration of scatter plots can prevent the reporting of nonsensical results. The data sets known as Anscombe's Quartet:

Data Set I |
Data Set II |
Data Set III |
Data Set IV | ||||

X |
Y |
X |
Y |
X |
Y |
X |
Y |

10.0 8.0 13.0 9.0 11.0 14.0 6.0 4.0 12.0 7.0 5.0 |
8.04 6.95 7.58 8.81 8.33 9.96 7.24 4.26 10.84 4.82 5.68 |
10.0 8.0 13.0 9.0 11.0 14.0 6.0 4.0 12.0 7.0 5.0 |
9.14 8.14 8.74 8.77 9.26 8.10 6.13 3.10 9.13 7.26 4.74 |
10.0 8.0 13.0 9.0 11.0 14.0 6.0 4.0 12.0 7.0 5.0 |
7.46 6.77 12.74 7.11 7.81 8.84 6.08 5.39 8.15 6.42 5.73 |
8.0 8.0 8.0 8.0 8.0 8.0 8.0 19.0 8.0 8.0 8.0 |
6.58 5.76 7.71 8.84 8.47 7.04 5.25 12.50 5.56 7.91 6.89 |

Identical correlation and regression statistics are derived for each of these data set (INTERCEPT = 3, SLOPE = 0.5, *r* = +0.82, etc.).
However, scatter plots reveal distinct relations (Fig.). Data set I demonstrates scatter typical of a statistical positive linear relation.
Data set II a complex nonlinear curve. Data set II demonstrates a perfect linear relation with one outlier. Data set IV demonstrates no
variability in X with the exception of an outlier in the upper right quadrant.

Another example is provided by the data in `WATER.ZIP` (Dean, 1942). Data in this file are from a study on water fluoridation. and
cavity rates. The variables are `FLUORIDE` (parts per million) and `CARIES` (dental caries per 100 children). A scatter plot of the data
(click here) reveals an decline in cavity rates as fluoride levels up to 1 *ppm*. There is an outlier in the lower left quadrant (record 21).
Although unmodified regression should not be applied to this pattern, we still have the following options:

- Remaining completely narrative, i.e., describe what you see, in this case an inverse curvilinear relation with one outlier.
- Excluding the outlier and then straighten-out the relation with, say, a logarithmic transformation to one or both of the variables.
- Eliminating the outlier (comment is
`SELECT RECNUMBER <> 21`) and then restricting the range of the analysis to the interval that demonstrates linearity.

The last option is achieved by issuing the command `SELECT FLUORIDE` <= 1 before issuing the `REGRESS CARIES FLUORIDE`
command. This derives the following output:

`Correlation coefficient: r = -0.92`

` r^2= 0.86`

`95% confidence limits: -0.98 < R < -0.76`

`Source df Sum of Squares Mean Square F-statistic`

`Regression 1 248393.9485 248393.9485 65.13`

`Residuals 11 41950.3592 3813.6690`

`Total 12 290344.3077`

`B Coefficients`

` B 95% confidence Partial`

`Variable Mean coefficient Lower Upper Std Error F-test`

`FLUORIDE 0.2615 -528.0656304 -656.311967 -399.81929 65.431804 65.1325`

`Y-Intercept 780.3402418`

The slope of -528 suggests that one *ppm *of fluoride is associated with an expected reduction of 528 cavities per 100 children in the
range 0 - 1 *ppm*. Assuming linearity, this is equivalent to -52.8 per 100 children for each 0.1 *ppm* of fluoride. The coefficient of
determination (*r*^{2}) of 0.86 suggests that 86% of the variance in carie rates in this range is explained by fluoride level.

**(1) BIGTEN.ZIP**. *Graduation Rates at Big Ten Universities* (Data from Berk, 1994, p. 82). Download and unzip the data set BIGTEN
and then explore the relation between five-year graduation rates (`UPERCENT`) and the scores of incoming freshman on the ACT exam
(`ACT`).

**(2) IGUANA.ZIP.** *Iguana Eggs Over Easy* (Data from Hampton, 1994, p. 157). Listed below are data representing the body weight
and number of eggs produced by 9 gravid female iguanas. Analyze the data using both correlation and regression techniques.
Remember to interpret your findings.

`ID Weight Eggs`

`--- ------ ------`

`1 0.90 33`

`2 1.55 50`

`3 1.30 46`

`4 1.00 33`

`5 1.55 53`

`6 1.80 57`

`7 1.50 44`

`8 1.05 31`

`9 1.70 60`

**(3) ALCOHOL.ZIP**. *Alcohol Consumption Survey* (Data from Monder, 1986). Download and unzip the data set `ALCOHOL.REC`
(used in the previous chapter) and determine the linear relation between alcohol consumption (`ALCS`) and `AGE` using correlation and
regression techniques. Because the scatter plot is difficult to interpretable, assume any relation will be linear, without outliers, and
meet other regression assumptions.

Ansombe, F. J. (1973). Graphs in statistical analysis. American Statistician, 27, 17-21.

Berk K. N. (1994). Data Analysis with Student SYSTAT. Cambridge, MA: Course Technology.

Dean, H. T. (1942). Arnold, F. A., & Elvov, E.. Domestic water and dental caries. Public Health Reports, 57, 1155-1179.

Hampton, R. E. (1994). Introductory Biological Statistics. Deburque, IW: Wm. C. Brown.

Kruskal, W. H. (1960) Some remarks on wild observations. Technometrics Available: www.tufts.edu/~gdallal/out.htm.

Monder, H. (1986). [Alcohol consumption survey.]. Unpublished data.

Perales, D. & Gerstman, B. B. A bi-county comparative study of bicycle helmet knowledge and use by California elementary school children. The Ninth Annual California Conference on Childhood Injury Control, San Diego, CA, March 27-29, 1995.