San José State University
Department of Economics

applet-magic.com
Thayer Watkins
Silicon Valley
USA

 The Correlation of Variables Which Are the Cumulative Sums of Random Disturbances

The change in temperature of a physical body over some time interval is proportional to the net heat inflow over that interval. This is simple thermodynamics. Under natural conditions the heat inflow is a stochastic variable with fluctuations about a mean value. With a positive heat inflow the temperature of the body rises and it radiates energy away. Over time the temperature of the body reaches some equilibrium level and thereafter the net energy inflow (the heat energy inflow less the radiated energy outflow) has a mean value of zero and the temperature fluctuates about its equilibrium level.

This means the temperature is a cumulative sum of random deviations of the net energy inflow. The statistical significance of this fact cannot be overemphasized. The time paths of such variables will appear to have short term trends even when no long term trends exist. In particular the short term trends cannot be validly extrapolated. The expected future values are equal to the current value.

Two variables which are the cumulative sums of random disturbances will often appear to have a correlation even when there is no correlation between the random disturbances. The correlation may be positive or negative, but in either case there may appear to be a relationship between the two variables even when real relationship exists between the two variables.

The extent of the spurious correlation depends upon the number of data points. For example take the extreme cas of two data points (x1, y1) and (x2, y2). If the variables change in the same direction when moving from point 1 to point 2 then the correlation coefficient is +1.0. If they move in opposite directions the correlation coefficient is −1.0. If one or both do not change the correlation coefficient is undefined. The correlation depends on the random increments of both variables. With each increment being equally likely to be positive or negative the sample correlation coefficient will +1.0 half of the time and −1.0 half of the time. The histogram would look like the one below.

For longer intervals the situation is more complex. Below is shown a simulation in which 60 random values of mean zero are chosen for each of two variables. The sums are computed and then the correlation coefficient is computed for the two variables. The value of correlation coefficient is displayed to five digits for each sample. Each time the REFRESH button is clicked a new sample of 60 random disturbance is selected for each variable.

It is clear that the distribution of the sample correlation coefficient is symmetric. The symmetry of the distribution means the expected value is zero. Here are the histograms for the distributions of 1800 correlation coefficients. The first is for the case in which the data period is 60 units long.

The frequencies of the sample correlations does not drop off sharply until the value get beyond ±0.8.

A second trial for data period 30 was created to see if the length of the data period significantly affects the results.

There does not seem to be much difference for the two lengths of data periods.

The lesson is which are sums of random variations tend to be spuriously correlated positively or negatively. Therefore sample correlation coefficients between such variables in the range ±0.8 probably do not indicate significant correlations between the phenomena involved.