2: Frequency distributions (Key Odd)

Review Questions

  1. There are n / k leaves where k is usually 1. For large data sets, k can be any integer. 
  2. You would split stem-values when wanting to stretch-out the plot to better see its shape.
  3. The depth of an observation is its location when values in the data set are listed in ascending or descending order.
  4. The purpose is to help viewer decipher the magnitude of values.
  5. positive
  6. spread
  7. median
  8. arithmetic mean
  9. (n + 1) / 2 
  10. "outlier" (also called an "extreme value")
  11. frequency = count;  relative frequency = frequency expressed as proportion of total; cumulative relative frequency = the relative frequency up to and including the ordered value
  12. End point conventions are needed so we  know which class to put a value that falls on a boundary. Suppose, for example, we have 10-unit age intervals and the the first age-interval is 0 - 10, the second interval was 10 - 20, and so on. If there were no boundary rules, a value of 10 could fall into either the first or second class-interval. For the sake of uniformity, StatPrimer truncates the right-side of each class interval. Thus, the first interval would be 0 -9 and the second class-interval would be 10 - 19. A value of 10 would now clearly fall into the second class-interval.
  13.  True. Histograms have bars that touch. This is appropriate for quantitative data but not for categorical data. Use a bar chat with non-contiguous bars when plotting frequencies for categorical data. 

Exercises

2.1  Irish healthcare websites.  n = 46 

(A) Stemplot:  Shape: negative skew with a low outlier. Location: The median is 17 (underlined). Spread: Values range from 8 to 17.

08|0
09|
10|0
11|00
12|0
13|0000
14|00
15|0000000
16|0000
17|000000000000000000000000
×1

(B) 12 of 46 (26%) had a reading level of 14 or below.

2.3 Hospital duration data. Number of days hospitalized

(A) Regular stem

0|33344555567788999
1|0111147
2|
3|0
×10 (days hospitalized)

(B) Split stem

0|33344
0|555567788999
1|011114
1|7
2|
2|
3|0
×10 days hospitalized

(C) The split stem does a better job showing the distribution's shape and the outlier.

(D) Interpretation. Shape: The distribution is mound-shaped with a single mode; there is a positive skew with a single high outlier. Location: The median has a depth of (25 + 1) / 2 = 13 and value of 8 (underlined). Spread: Data range from 3 to 30. 

(E) Frequency table

DAYS       Freq.       %     Cumulative %
------     ------  --------  ------------  
0 - 4         5       20%         20%
5 - 9        12       48%         68%
10 - 14       6       24%         92%
15 - 19       1        4%         96%
20 - 24       0        0%         96%
25 - 29       0        0%         96%
30 - 34       1        4%        100%
------------------------------------------
TOTAL        25      100%         --

2.5 %Body weight expressed as a percentage of ideal (%IDEAL). Percent of ideal body weight (n = 18)

(A) Stemplot - Data have a negative skew and high outlier (shape), the median is 114 (underlined), and spread from 88 to 152. 

08|8
09|59
10|0147
11|444679
12|0145
13|
14|
15|2
×10

(B) Frequency table, 20-unit class intervals:

%IDEAL      Freq.   RelFreq (%)   CumFreq (%)
------     ------  ------------   -----------  
 80 - 99      3       16.7%       16.7%
100 - 119    10       55.6        72.2
120 - 139     4       22.2        94.4
140 - 159     1        5.6       100.0% 
------------------------------------------
TOTAL        18      100.0%       --

2.7 Children of physicians (DOCKIDS)

(A) The stemplot is shown below. Data have a positive skew (shape), a median of 2 (underlined, central location), and spread from 0 to 7. 

0|0000
1|00000
2|000000
3|000
4|00
5|0
6|00
7|0
No. of children

(B) Freq table

Value   Freq.  RelFreq (%)   CumFreq (%)
------  -----  -----------   ----------  
0         4      16.7%        16.7%  
1         5      20.8         37.5
2         6      25.0         62.5 
3         3      12.5         75.0
4         2       8.3         83.3
5         1       4.2         87.5
6         2       8.3         95.8
7         1       4.2        100.0%
-----------------------------------
TOTAL    24     100.0%         --

(C) 75%
(D) 62.5%

2.9 Grad student ages (n = 36). There is a low outlier (16), which could represent a data entry error (26?) or could be a prodigy. The median is 28 (underlined). Data spread from 16 to 33. 

16|0
17|
18
19|
20|
21|
22|
23|
24|00
25|00000
26|00
27|0000
28|0000000
29|0000000
30|0000
31|0
32|00
33|0
x1

2.11 UNICEF low birth weight data (UNICEF.SAV)

(A) Here's a picture of the SPSS plot. There is a positive skew, a low outlier and four high outliers ("extremes"). The location of the median = (109 + 1) / 2 = 55, so the median must be 9. Data spread from 1 to something greater than or equal to 28--we do not know the precise values of the extremes from the SPSS stemplot, but do see from frequency table (part B) that the maximum value is 39.

(B) USA value = 7

(C) Frequency table. From the frequency table below, we can see that a value of 7 has a cumulative percent of 36.7. So does 12 other countries. 

Aside: How to rank ties. We often encounter ties when data are put into depth order. We can try to extrapolate the rankings of ties with a little more accuracy by sorting the data from low to high and assigning an initial depth, wherever the cards may fall. When there is a tie, we give each value in that tie its average rank. For instance, the 6 lowest data points in rank order are shown below. Notice that Finland, Ireland, Norway, and Sweden all have a value of 4, putting them at depths between 2 and 5. It doesn't make sense to give a different rank to Finland, Ireland, Norway, and Sweden, so we average the ranks and give each a rank of (2 + 5) / 2 = 3.5. (If we continued with this logic, we could show that the USA has a rank of 34.)

Country

Data

Rough Depth

Rank

Spain               1 1 1
Finland             4 2 tied for 3.5
Ireland             4 3 tied for 3.5
Norway              4 4 tied for 3.5
Sweden     4 5 tied for 3.5
etc.