Before Data are Analyzed

• Study Design • Data Collection

Descriptive Statistics

Basic Statistical Inference

• Two Traditional Forms of Inference • Parameters and Statistics • Estimation • Hypothesis Testing • Power & Sample Size

Reporting Results

• Narrative Summary • How to Report Statistics

Addendum

• Link to Lecture Notes • Vocabulary

References

To analyze and interpret data, one must first understand fundamental statistical principals. Statistical topics are normally covered in introductory courses and texts, and cannot be given full justice in this brief chapter. However, a brief review of some principals may prove helpful.

When analyzing data, one must keep clearly in mind the question that prompted the research in the first place. The research question must be articulated clearly, concisely, and accurately. It must be enlightened.

Once the research question has been defined, a study is designed specifically to answer it. This is a the element in determining study success. Some study design features to consider are:

- How will the study outcome be measured? Will measurements be objective (so that things are observed as they are without falsifying observations to accord with some preconceived world view)? Will measurement be reliable (so that observations can be consistently repeated)?
- How will relations between factors be quantified? What parameter will be estimated?
- How large a sample will be needed to ensure a sufficiently precise answer?
- Will the study be experimental or nonexperimental? (Experimental studies entail an intervention.)
- If the study is experimental, what type of control group will be used?
Will the intervention be
- If the study is nonexperimental, will observations be cross-sectional or longitudinal?
- If the study is nonexperimental, will data be prospective or retrospective? Will the sample be cross-sectional, cohort, or case-control?

These and other questions must be addressed well before collecting data. An introduction to study design can be found by clicking here.

**Good data** is expensive and takes a lot of time to collect. Consider your data source carefully. Sources of data include medical record
abstraction, questionnaire, physical exam, biospecimens, environmental sampling, direct examination, etc. The data collection form
("instrument") must be carefully calibrated, tested, and maintained. If using a questionnaire, questions must be simple, direct,
non-ambiguous, and non-leading. To encourage accuracy and compliance, survey questionnaires should be brief. When asking
questions, *nothing should be taken for granted*.

The **study protocol **must be documented. How will the population be sampled? How will you deal with subjects who refuse to
participate or are lost to follow-up? Criteria for managing missing and messy data should be discussed *before *problems are
encountered. Once data are collected, how will you prevent data processing errors? Who will be responsible for entering, cleaning, and
documenting the data? Who is going to back-up data? Seemingly mundane elements of data processing must be worked out in advance
of the study.

Reasonable analyses come only *after *a good description is established. The type of description appropriate to an analysis depends on
the nature of data. At its simplest, qualitative (categorical) data requires counts, proportions, rates, and ratios. With quantitative
(continuous) data, distributional shape, location, and spread must be described.

The shape of a distribution refers to the configuration points when plotted. Useful graphs include *histogram*, *stem-and-leaf plot*, *dot
plot, *or *boxplot*. When assessing shape, consider the data's symmetry, modality, and kurtosis.

The location of a distribution is summarized by its center. The most common statistical measures of central location are the mean, median, and mode.

The spread of a distribution's refers to its dispersion (variability) around its center. The most common summary measures of spread are the standard deviation, interquartile range, and range.

We are also often interested in describing associations between variables. Association refers to the degree to which values "go together." Associations may be positive, negative, and neutral. The measure of association well vary depending on the nature of the data. Examples of associational measures include mean difference (paired and independent), regression coefficients, and risk ratios.

Statistical inference is the act of generalizing from a sample to a population with calculated degree of certainty. The importance of inference during data analysis is difficult to overstate,

. . . for everyone who does habitually attempt the difficult task of making sense of figures is, in fact, essaying a logical process of the kind we call inductive, in that he is attempting to draw inferences from the particular to the general; or, as we more usually say in statistics, from the sample to population. (Fisher, 1935, p. 39)

The two traditional forms of statistical inference are estimation and hypothesis testing. Estimation predicts the most likely location of a parameter and hypothesis testing ("significance" testing) provides a "yes" or "no" answer to a statistical question. Examples will illustrate their use.

It is common for epidemiologists to want to learn about the prevalence of a condition -- smoking for instance -- based on the prevalence of the condition in a sample. In a given sample, the final inference may be "25% of the population smokes" (point estimation). Alternatively, the inference may take the form that "between 20% and 30% of the population smokes" (interval estimation). Finally, the epidemiologist might simply want to test whether smoking rates have changed over time. In such instances, a simple "yes" or "no" conclusion would suffice (hypothesis testing).

Whether one uses estimation or hypothesis testing depends on the nature of the inference and philosophy of the investigator. When direction and "amount" are important, estimation seems most useful. When a categorical answer to a question is needed, testing seems helpful. In practice, both estimation and hypothesis are important.

Regardless of the inferential method used, it is important to keep clearly in mind the distinction between the *parameters *being inferred
and the *estimates *used to infer them. Although the two are related, they are not interchangeable.

- Parameters are summaries of the
*population*, will estimates are summaries of the sample. - Parameters are unknown; statistics are calculated.
- Parameters are hypothetical; whereas statistics are "real."
- Parameters are constants; statistics are random variables.

Statisticians use different symbols to represent estimators and population parameters. For example, the symbol *p*^ ("p hat") is used to
represent a sample proportion (the estimate) and *p* is used to represent the population proportion (the parameter).

There are two forms of estimation point estimation and interval estimation. *Point estimation* provides a single point that is maximally
likely to represent the parameter. For example, a sample proportion (*p*^) may be viewed as the maximum likelihood point estimator of
population proportion (*p*). *Interval estimation *provides a interval that has a calculated likelihood of capturing the parameter. For
example, a 95% confidence interval for population *p* will capture this parameter 95% of the time (or so it is said). That is, if, we
independently repeated the study an infinite number of times, 95% of our calculated intervals would capture the parameter and 5%
would fail to capture the parameter. However, for any given confidence interval, the parameter *is* or *isn't* captured. This level of
uncertainty is an inevitable when working with empirical data. At least, now, it is quantified.

So what of hypothesis testing? First, we must note that there exists considerable misunderstanding about hypothesis testing. In
reference to the misunderstanding, we might acknowledge two different views exist on the procedure. *Significance testing*, as
described by R. A. Fisher, suggests that a *p* value can be used to quantifiable the belief that the data are significant. In contrast,
*hypothesis testing*, as described by Neyman and Pearson, provides for certain decision rules about the null hypothesis. The extent to
which these views are irreconcilable is a matter of opinion that goes well beyond the scope of this modest article. Interested readers
wishing to learn more about this debate are referred to Lehmann (1993), Goodman (1993), and Bellhouse (1993). For now, let us
simply note that both significance testing and hypothesis testing are misunderstood.

Second, we must be loathed to "accept" the any null hypothesis. John Tukey (1991) states:

Statisticians classically asked the wrong questions -- and were willing to answer with a lie, one that was often a downright lie. They asked "Are the effects of A and B different?" and they were willing to answer "no."

All we know about the world teaches us that the effects of A and B are always different -- in some decimal place - for any A and B. Thus asking "Are the effects different?" is foolish.

What we should be answering first is "Can we tell the direction in which the effects of A differ from the effects of B?" In other words, can we be confident about the direction from A to B? Is it "up," "down" or "uncertain"?

The third answer to this first question is that we are "uncertain about the direction" - it is not, and never should be, that we "accept the null hypothesis."

In other words, Tukey points out that A and B will always be different to some small degree, but it is the direction and our confidence in our decision that is decided by the test. Moreover, and implicitly in the above statement, the magnitude of the difference is not addressed by the hypothesis test. Pity.

How might we attempt to correct some of the misunderstanding of hypothesis testing, then? Let us start by addressing the intent of the test itself.

We first note that hypothesis testing starts with an irrational *assumption *of "no differences." This premise is formalized in the form of
a *null hypothesis *(*H*_{0}). According to fixed-level hypothesis testing theory, the primary goal of the test is to limit the number of false
rejections of null hypotheses, and *not *to prevent false rejections entirely (as this would be impossible). A false rejection of the null
hypothesis is referred to as a *type I error*:

H_{0} is true |
H_{0} is false | |

Retain H_{0} |
Correct Retention | Type II Error |

Reject H_{0} |
**Type I Error!** |
Correct Rejection |

*Alpha (a**)*, or the *significance level *of the test, is the probability we are willing to take in the making a type I error. By convention
alpha is usually set to 0.05 or 0.01. There is nothing sacred about .05 or .01. They are merely conventions.

Our decision about the null hypothesis will be based on a *test statistic *which describes the "unusualness" of observed differences,
assuming the null hypothesis is true. When the test statistic is unlikely to have come from a population described by the null
hypothesis, the null hypothesis will be rejected. Given the ubiquity of modern computer programs, this is usually done with the help of
a *p* value.

We may define apvalue as the probability of observing the test statistic or a test statistic that is more extreme than current test statistic, assuming the null hypothesis to be true.

When the *p* value is sufficiently small -- defined as *p*-value < a -- the null hypothesis is rejected. This "rule" seems straight-forward
enough. However, the basis of this rule is so often ignored that many prominent statisticians have called for the dismissal of hypothesis
testing, entirely. But why dismiss such a well-established procedure, you might ask? Cohen (1994) argues:

Well, among many other things, [hypothesis testing] does not tell us what we want to know, and we so much want to know what we want to know that, out of desperation, we nevertheless believe that it does! What we want to know it "Given these data, what is the probability thatH_{0}is true?" But as most of us know, what it tells us is "Given thatH_{0}is true, what is the probability of these data?" These are not the same, as has been pointed out many times over the years. So, thepvalue is not a measure of how good the null hypothesis is. It is a measure of how well the data are, assuming the null hypothesis is true.pValues should, thereby, not be viewed as objective probabilities, since they are based on a premise that may be -- and probably is -- entirely false.

Nevertheless, hypothesis testing continues to enjoy almost universal use, and is so entrenched in research and publication that it will
probably live on forever. More importantly, *when combined with other forms of information* (such as descriptive statistics, estimation,
biological reasoning, and so on), hypothesis testing provides important safeguards against imprudent conclusions. It is therefore our
responsibility to interpret hypothesis tests correctly, and apply them when needed.

But what of the retained null hypothesis? Might this be false as well? Of course it can, and a false retention of *H*_{0} is called a type II error:

H_{0 }is true |
H_{0 }is false | |

Retain H_{0} |
Correct Retention | **Type II Error!** |

Reject H_{0} |
Type I Error | Correct Rejection |

The probability of a type II error is called *beta* (b), and the complement of beta (the probability of avoiding a type II error) is called
*power *(1 - b). Studies with inadequate power are a waste of time, money, and resources. Since power is a function of sample size, it
behooves the researcher to collect enough data during his or her studies. For an introduction on how to determine an adequate sample
size, see http://www.tufts.edu/~gdallal/SIZE.HTM .

Abelson, in his excellent book *Statistics as Principled Argument* (1995), suggests that the presentation of statistical results importantly
entails rhetoric. The virtues of a good statistician, therefore, involve not only the skills of a good detective, but also the skills of a
good storyteller. As a good story teller, it is essential to argue flexibly and in detail for a particular case; data analysis should *not *be
pointlessly formal. Rather, it should make an interesting claim by telling a tale that an informed audience will care about, doing so
through an intelligent interpretation of data.

Reporting and presenting results are important parts of a statistician's job. In general, the statistician should *always use judgement
*when reporting statistics, and always report findings in a way this is consistent with what he or she wishes to learn. With this in mind,
here are some guidelines for reporting statistics:

- "Describe statistical methods with enough detail to enable a knowledgeable reader with access to the original data to verify the
reported results. When possible, quantify findings and present them with appropriate indicators of measurement error or uncertainty
(such as confidence intervals). Avoid sole reliance on statistical hypothesis testing and
*p*values for they fail to convey important quantitative information [-- a*p*value by itself is seldom acceptable] . . . Give numbers of observations. . . . Specify any general-use computer programs used." (International Committee, 1988; Bailar & Mostellar, 1988). - The number of
*decimal places*reported in final statistics is contingent on the precision of the data. Precise data warrant many decimal places; imprecise data do not. For example, an averages age in adults need be reported to only one decimal place (e.g., 68.1 years),*not*four (e.g., 68.1276 years). With this said, here are rules-of-thumb to keep in mind when reporting results. - For summary statistics (e.g., means, standard deviations), report one digit more than was present in the raw data. For example, if age is recorded to the nearest whole year, report the mean age to the nearest tenth of a year (e.g., mean = 54.3 years).
- For percentages, the nearest whole percent (e.g., 25%) is usually adequate (APA, 1994), although many journals prefer percentages to the nearest tenth of a percent (e.g., 25.4%).
- For test statistics, such as chi-square statistics,
*t*statistics, and*F*statistics, use two decimal place accuracy (APA, 1994, p. 104). For example, report*t*statistic = 2.56. - For
*p*values, two significant digits will do (Bailar & Mosteller, 1988). For example, report*p*= 0.0062. Notice that leading zeros do*not*count as significant digits. - Odds ratios and relative risks should be reported to one decimal place accuracy (e.g.,
*OR*= 3.1, not 3.11). - Do
*not*use leading zeros before a decimal point when the number cannot exceed 1 (APA, 1994, p. 104). For example, report a = .05.*Do*use leading zeros before a decimal point when the number can be greater than 1. For example, report mean serum creatinine level = 0.973 mg/dl. - Always report units of measure. For example, mean serum creatinine = 0.973
*mg/dl*. - Statistics in text should include sufficient information to permit the reader to corroborate the analysis (APA, 1994, p. 112; Bailar & Mosteller, 1988).
- Each journal has its own reporting standards. For example, San Jose State University requires APA Style (1994) whereas the
*American Journal of Public Health*requires the Uniform Biomedical Style (International Committee, 1988).

- Alpha
- Alternative hypothesis
- Association
- Beta
- Confidence interval
- Estimation
- Experimental study
- Hypothesis test
- "Instrument"
- "Location"
- Null hypothesis
- Observational study
*p*value- Parameter
- Point estimate
- Population
- Power
- Sample
- "Shape"
- "Spread"
- Statistic
- Statistical inference
- Type I error
- Type II error

Abelson R. P. (1995). *Statistics as Principled Argument*. Hillsdale, NJ: Lawrence Erlbaum Associates.

American Psychological Association [APA]. (1994). *Publication Manual *(4th ed.). Washington, DC: Author.

Bailar, J. C. & Mosteller, F. (1988). Guidelines for statistical reporting in articles for medical journals. *Annals of Internal Medicine*,
108, 266 - 273.

Bellhouse, D. R. (1993). Invited commentary: *p* values, hypothesis tests and likelihood. American *Journal of Epidemiology*, 137, 497 -
499.

Cohen, J. (1994). The earth is round (*p* < .05). *American Psychologist*, 49, 997 - 1003.

Dallal, G. E. (1997). *Sample Size Calculations Simplified.* http://www.tufts.edu/~gdallal/SIZE.HTM

Dallal, G. E. (1997). *Some Aspects of Study Design*. http://www.tufts.edu/~gdallal/STUDY.HTM

Fisher, R. A. (1935). The logic of inductive inference. *Journal of the Royal Statistical Society*, 98, 39 - 54.

Fisher, R. (1973). *Statistical Methods and Scientific Inference*. (3^{rd} ed.). New York: Macmillan.

Goodman, S. N. (1993). *P* values, hypothesis tests, and likelihood: implications for epidemiology of a neglected historical debate.
*American Journal of Epidemiology*, 137, 485 - 496.

International Committee of Medical Journal Editors [International Committee]. (1988). Uniform requirements for manuscripts
submitted to biomedical journals. *Annals of Internal Medicine*, 108: 258 - 265.

Lehmann, E. L. (1993). The Fisher, Neyman-Pearson theories of testing hypotheses: one theory or two? *Journal of the American
Statistical Association*, 88, 1242 - 1249.

Tukey, J. W. (1991). The philosophy of multiple comparisons. *Statistical Science*, 6, 100 - 116.