2: Stem-&-Leaf Plots, Frequency Tables, and Histograms

Stem-and-Leaf Plots
Frequency Tables
� Raw Data � Uniform Class Intervals � Nonuniform Class Intervals
Histograms

Stem-and-Leaf Plots

The stem-and-leaf plot is an excellent way to start an analysis. To construct a stem-and-leaf plot:

(A) Draw a stem-like axis that covers the range of potential values.
(B) Round the data to two or three significant digits.
(C) Separate each data-point into a stem component and leaf component. The stem component consists of all but the rightmost digit; the leaf component consists of the rightmost digit.
(D) Place each leaf value adjacent to its associated stem value, one leaf on top of the other.

To illustrate stem-and-leaf plots, let us consider a data set with the following numerical values:

21, 42, 5, 11, 30, 50, 28, 27, 24, 52

To start, draw a stem-like axis that extends from the data set's minimum to its maximum:

|5| |4| |3| |2| |1| |0| (x 10)

An axis multiplier (x 10) is included to allow the viewer to decipher the value of each data point.

The rightmost digit of each data point (the "leaf") is then plotted against the stem-like axis. For example, a value 21 is plotted as:

|5| |4| |3| |2|1 |1| |0| (x 10)

The remaining data points are plotted:

|5|02 |4|2 |3|0 |2|1874 |1|1 |0|5 (x 10)

Data are now sorted in approximate rank order, and the shape, location and spread of the distribution are evident. I'm going to flip the stem-and-leaf to a horizontal orientation to better display these features.

The location of the data can be summarized by its center. For example, the center of the above stem-and-leaf plot is located between 20 and 30.

4 7 8 2 5 1 1 0 2 0 ------------ 0 1 2 3 4 5 ------------ ^ Center

The spread of the distribution is seen as the dispersion of values around the distribution's center.

4 7 8 2 5 1 1 0 2 0 ------------ 0 1 2 3 4 5 ------------ <----|----> Spread

The shape of the distribution can be seen as a "skyline silhouette" of the data.

X X X X X X X X X X ------------ 0 1 2 3 4 5 ------------

Notice the "skyscraper" in the middle of the distribution. This peak represents the distribution's "mode." The mode of this particular data set in the interval 20 to 30. Also notice that the data demonstrate pretty-good symmetry around the mode. (Nothing is perfect in statistics, especially when the sample is small.)

With a little practice, a distribution's shape, location, and spread can be visualized through the stem-and-leaf plot.

Second Illustrative Example of a Stem-and-Leaf Plot: The next illustrative example shows how a stem-and-leaf plot can be modified to accommodate data that might not immediately lend itself to this type of plot. Consider this new data set:

1.47, 2.06, 2.36, 3.43, 3.74, 3.78, 3.94

These data have 3 significant digits and a decimal point. In such instances, we first round the data to two significant digits. The data set rounded to two significant digits is: {1.5, 2.1, 2.4, 3.4, 3.7, 3.8, 3.9}. When plotting these points, decimal points are ignored. Using stem values of 1, 2, and 3, we plot:

|1|5 |2|14 |3|4789 (x 1)

Realizing that this plot is somewhat squashed, we could spread it out by splitting the stem using double values with the first value reserved for leaf-values between 0 and 4 and the second stem-value reserved for leaf-values between 5 and 9. Here's the same data with double stem-values:

|1| |1|5 |2|14 |2| |3|4 |3|789 (x 1)

This shows that there is more than one correct way to plot a stem-and-leaf diagram for a given data set.

SPSS: To create a stem-and-leaf plot in SPSS, click on Statistics | Summarize | Explore and select the variable you want a stem-and-leaf plot of in the "Dependent List" dialogue box.

Frequency Tables

Raw Data

It is often useful to consider data in the form of a frequency table. Frequency tables may include three different types of frequencies. These are:

Frequency counts: The number of times a value occurs in a data set.
Relative frequencies: Frequencies expressed as percentages of the total.
Cumulative frequencies: Relative frequencies up to and including the current rank-ordered value.

An example of a table of AGE frequencies is:

Table 1. Age (in years), respiratory health survey respondents.

AGE | Freq Rel.Freq Cum.Freq. ------+----------------------- 3 | 2 0.3% 0.3% 4 | 9 1.4% 1.7% 5 | 28 4.3% 6.0% 6 | 37 5.7% 11.6% 7 | 54 8.3% 19.9% 8 | 85 13.0% 32.9% 9 | 94 14.4% 47.2% 10 | 81 12.4% 59.6% 11 | 90 13.8% 73.4% 12 | 57 8.7% 82.1% 13 | 43 6.6% 88.7% 14 | 25 3.8% 92.5% 15 | 19 2.9% 95.4% 16 | 13 2.0% 97.4% 17 | 8 1.2% 98.6% 18 | 6 0.9% 99.5% 19 | 3 0.5% 100.0% ------+----------------------- Total | 654 100.0%

To create a frequency table:

(A) List all potential values in ascending order

(B) Tally frequency counts (f_i) with tick marks or some other accounting mechanism. List these frequencies in the Freq column of the table.

(D) Calculate relative frequencies (percentages) for each value (p_i = f_i / n).

(E) Calculate cumulative frequencies by adding the cumulative frequency from the prior level to the relative frequency of the current level (c_i = p_i + c_i_-1).

SPSS: To create a frequency table in SPSS, click on Statistics | Summarize | Frequencies and select the variable you want a frequency table of in the "Variable(s):" dialogue box.

As an additional example, let us reconsider the data set {21, 42, 5, 11, 30, 50, 28, 27, 24, 52}. A frequency table for this data set is:

Value Talley Freq. RelFreq CumFreq ------ ------ ----- -------- ------- 5 / 1 10% 10% 11 / 1 10% 20 21 / 1 10% 30% 24 / 1 10% 40% 27 / 1 10% 50% 28 / 1 10% 60% 30 / 1 10% 70% 42 / 1 10% 80% 50 / 1 10% 90% 52 / 1 10% 100% ------------------------------------------ TOTAL 10 100% --

Because of the small size of the sample, this frequency table is not particularly useful. For it to become more useful, data must be grouped into class intervals.

Uniform Class Intervals

It is often difficult to learn much by looking at a frequency listing of values when the dataset is small so that each value in the data set appears only once or twice. To address this problem, we condense the data into class interval groupings.

There are no hard-and-fast rules for determining appropriate class intervals. Nevertheless, here are some rules-of-thumb by which to begin:

(A) Decide on an appropriate number of class-interval groupings: The optimum number of class groupings will depend on the range of values and the size of the data set. In general, large data sets can support a large number of class groupings and small data sets can support fewer. Deciding on a suitable number of class-intervals may require some trial and error. To start, try class-intervals that are of equal and convenient length (e.g., 10-year age intervals) or have substantive meaning (e.g., hypotensive / normotensive / borderline hypertensive / hypertensive). Normally, 4 to 12 class-intervals is usually sufficient.

(B) Determine the class interval width. This can be determined with the formula:

Class-interval width = (maximum value - minimum value) / (no. of desired class groupings)

For example, to create 8 class groupings for a data set with a maximum of 19 and minimum of 3, the class interval width = (19 - 3) / 8 = 2.

(C) Set endpoint conventions. If an observation falls on the boundary between two class intervals, we need know in which class interval it will be counted. The two choices are to: (a) include the left boundary and exclude the right boundary or (b) include the right boundary and exclude the left boundary. When faced with this choice, we will use the option "a". For example, when consideration a two unit interval of 2 to 4, we will exclude the right boundary of 4, so that the interval is between 2 (inclusive) up to 4 (exclusive).

(D) Tabulate the data: Once boundaries are established, the data are counted in the usual way.

A frequency table for the small data set {21, 42, 5, 11, 30, 50, 28, 27, 24, 52} with 15-year age class-interval grouping can now e shown:

Range Tally Freq. RelFreq CumFreq ------ ------ ----- -------- ------- 0-14 // 2 20% 20% 15-29 //// 4 40% 60% 30-44 // 2 20% 80% 45-54 // 2 20% 100% ------------------------------------------ TOTAL 10 100% --

SPSS: To group data in SPSS, click on Transform | Recode | Into Different Variable. This will allow you to set up ranges to serve as class intervals. After recoding the data to these new class intervals (ranges), a Statistics | Summarize | Frequencies command can be directed against the newly recoded variable.

Nonuniform Class Intervals

At times we might want to use nonuniform class-intervals when describing frequencies. For example, we may want to look at age distribution of children with ages grouped as preschool (2-4 years), elementary school (5-11-years), middle-school (12-13-years), and highschool (14-19-years). The data from Table 1 can now be displayed as follows:

Histograms

The investigator may choose to graphically review frequencies in the form of a histogram. Histograms are bar charts that display frequencies or relative frequencies in the form of contiguous (touching) bars. A histogram of the age distribution data (before it was condensed into class intervals) is shown in the figure to the right:

SPSS: To create a histogram, click on Graphs | Histogram. To create a frequency polygon with SPSS, click on Graphs | Line | Define, and then select the variable you want to graph in the Category Axis dialogue box. (The line will represent the number of cases, by default.)

Notation

n - sample size
f_i = frequency, value or interval i
p_i = relative frequency, value or interval i
c_i = cumulative frequency, value or interval i

Vocabulary

Cumulative frequency: the accumulation of relative frequencies up to and including the rank-ordered value or class.
Frequency: the number of times a particular item occurs.
Histogram: a bar graph of frequencies or relative frequencies in which bars touch.
Relative frequency: frequencies expressed as a percentage of the total.
Stem-and-leaf plot: a histogram-like display of raw values in a data set.