Before Data are Analyzed
• Study Design • Data Collection
Basic Statistical Inference
• Two Traditional Forms of Inference • Parameters and Statistics • Estimation • Hypothesis Testing • Power & Sample Size
• Narrative Summary • How to Report Statistics
• Link to Lecture Notes • Vocabulary
To analyze and interpret data, one must first understand fundamental statistical principals. Statistical topics are normally covered in introductory courses and texts, and cannot be given full justice in this brief chapter. However, a brief review of some principals may prove helpful.
When analyzing data, one must keep clearly in mind the question that prompted the research in the first place. The research question must be articulated clearly, concisely, and accurately. It must be enlightened.
Once the research question has been defined, a study is designed specifically to answer it. This is a the element in determining study success. Some study design features to consider are:
These and other questions must be addressed well before collecting data. An introduction to study design can be found by clicking here.
Good data is expensive and takes a lot of time to collect. Consider your data source carefully. Sources of data include medical record abstraction, questionnaire, physical exam, biospecimens, environmental sampling, direct examination, etc. The data collection form ("instrument") must be carefully calibrated, tested, and maintained. If using a questionnaire, questions must be simple, direct, non-ambiguous, and non-leading. To encourage accuracy and compliance, survey questionnaires should be brief. When asking questions, nothing should be taken for granted.
The study protocol must be documented. How will the population be sampled? How will you deal with subjects who refuse to participate or are lost to follow-up? Criteria for managing missing and messy data should be discussed before problems are encountered. Once data are collected, how will you prevent data processing errors? Who will be responsible for entering, cleaning, and documenting the data? Who is going to back-up data? Seemingly mundane elements of data processing must be worked out in advance of the study.
Reasonable analyses come only after a good description is established. The type of description appropriate to an analysis depends on the nature of data. At its simplest, qualitative (categorical) data requires counts, proportions, rates, and ratios. With quantitative (continuous) data, distributional shape, location, and spread must be described.
The shape of a distribution refers to the configuration points when plotted. Useful graphs include histogram, stem-and-leaf plot, dot plot, or boxplot. When assessing shape, consider the data's symmetry, modality, and kurtosis.
The location of a distribution is summarized by its center. The most common statistical measures of central location are the mean, median, and mode.
The spread of a distribution's refers to its dispersion (variability) around its center. The most common summary measures of spread are the standard deviation, interquartile range, and range.
We are also often interested in describing associations between variables. Association refers to the degree to which values "go together." Associations may be positive, negative, and neutral. The measure of association well vary depending on the nature of the data. Examples of associational measures include mean difference (paired and independent), regression coefficients, and risk ratios.
Statistical inference is the act of generalizing from a sample to a population with calculated degree of certainty. The importance of inference during data analysis is difficult to overstate,
. . . for everyone who does habitually attempt the difficult task of making sense of figures is, in fact, essaying a logical process of the kind we call inductive, in that he is attempting to draw inferences from the particular to the general; or, as we more usually say in statistics, from the sample to population. (Fisher, 1935, p. 39)
The two traditional forms of statistical inference are estimation and hypothesis testing. Estimation predicts the most likely location of a parameter and hypothesis testing ("significance" testing) provides a "yes" or "no" answer to a statistical question. Examples will illustrate their use.
It is common for epidemiologists to want to learn about the prevalence of a condition -- smoking for instance -- based on the prevalence of the condition in a sample. In a given sample, the final inference may be "25% of the population smokes" (point estimation). Alternatively, the inference may take the form that "between 20% and 30% of the population smokes" (interval estimation). Finally, the epidemiologist might simply want to test whether smoking rates have changed over time. In such instances, a simple "yes" or "no" conclusion would suffice (hypothesis testing).
Whether one uses estimation or hypothesis testing depends on the nature of the inference and philosophy of the investigator. When direction and "amount" are important, estimation seems most useful. When a categorical answer to a question is needed, testing seems helpful. In practice, both estimation and hypothesis are important.
Regardless of the inferential method used, it is important to keep clearly in mind the distinction between the parameters being inferred and the estimates used to infer them. Although the two are related, they are not interchangeable.
Statisticians use different symbols to represent estimators and population parameters. For example, the symbol p^ ("p hat") is used to represent a sample proportion (the estimate) and p is used to represent the population proportion (the parameter).
There are two forms of estimation point estimation and interval estimation. Point estimation provides a single point that is maximally likely to represent the parameter. For example, a sample proportion (p^) may be viewed as the maximum likelihood point estimator of population proportion (p). Interval estimation provides a interval that has a calculated likelihood of capturing the parameter. For example, a 95% confidence interval for population p will capture this parameter 95% of the time (or so it is said). That is, if, we independently repeated the study an infinite number of times, 95% of our calculated intervals would capture the parameter and 5% would fail to capture the parameter. However, for any given confidence interval, the parameter is or isn't captured. This level of uncertainty is an inevitable when working with empirical data. At least, now, it is quantified.
So what of hypothesis testing? First, we must note that there exists considerable misunderstanding about hypothesis testing. In reference to the misunderstanding, we might acknowledge two different views exist on the procedure. Significance testing, as described by R. A. Fisher, suggests that a p value can be used to quantifiable the belief that the data are significant. In contrast, hypothesis testing, as described by Neyman and Pearson, provides for certain decision rules about the null hypothesis. The extent to which these views are irreconcilable is a matter of opinion that goes well beyond the scope of this modest article. Interested readers wishing to learn more about this debate are referred to Lehmann (1993), Goodman (1993), and Bellhouse (1993). For now, let us simply note that both significance testing and hypothesis testing are misunderstood.
Second, we must be loathed to "accept" the any null hypothesis. John Tukey (1991) states:
Statisticians classically asked the wrong questions -- and were willing to answer with a lie, one that was often a downright lie. They asked "Are the effects of A and B different?" and they were willing to answer "no."
All we know about the world teaches us that the effects of A and B are always different -- in some decimal place - for any A and B. Thus asking "Are the effects different?" is foolish.
What we should be answering first is "Can we tell the direction in which the effects of A differ from the effects of B?" In other words, can we be confident about the direction from A to B? Is it "up," "down" or "uncertain"?
The third answer to this first question is that we are "uncertain about the direction" - it is not, and never should be, that we "accept the null hypothesis."
In other words, Tukey points out that A and B will always be different to some small degree, but it is the direction and our confidence in our decision that is decided by the test. Moreover, and implicitly in the above statement, the magnitude of the difference is not addressed by the hypothesis test. Pity.
How might we attempt to correct some of the misunderstanding of hypothesis testing, then? Let us start by addressing the intent of the test itself.
We first note that hypothesis testing starts with an irrational assumption of "no differences." This premise is formalized in the form of
a null hypothesis (H0). According to fixed-level hypothesis testing theory, the primary goal of the test is to limit the number of false
rejections of null hypotheses, and not to prevent false rejections entirely (as this would be impossible). A false rejection of the null
hypothesis is referred to as a type I error:
|H0 is true||H0 is false|
|Retain H0||Correct Retention||Type II Error|
|Reject H0||**Type I Error!**||Correct Rejection|
Alpha (a), or the significance level of the test, is the probability we are willing to take in the making a type I error. By convention alpha is usually set to 0.05 or 0.01. There is nothing sacred about .05 or .01. They are merely conventions.
Our decision about the null hypothesis will be based on a test statistic which describes the "unusualness" of observed differences, assuming the null hypothesis is true. When the test statistic is unlikely to have come from a population described by the null hypothesis, the null hypothesis will be rejected. Given the ubiquity of modern computer programs, this is usually done with the help of a p value.
We may define a p value as the probability of observing the test statistic or a test statistic that is more extreme than current test statistic, assuming the null hypothesis to be true.
When the p value is sufficiently small -- defined as p-value < a -- the null hypothesis is rejected. This "rule" seems straight-forward enough. However, the basis of this rule is so often ignored that many prominent statisticians have called for the dismissal of hypothesis testing, entirely. But why dismiss such a well-established procedure, you might ask? Cohen (1994) argues:
Well, among many other things, [hypothesis testing] does not tell us what we want to know, and we so much want to know what we want to know that, out of desperation, we nevertheless believe that it does! What we want to know it "Given these data, what is the probability that H0 is true?" But as most of us know, what it tells us is "Given that H0 is true, what is the probability of these data?" These are not the same, as has been pointed out many times over the years. So, the p value is not a measure of how good the null hypothesis is. It is a measure of how well the data are, assuming the null hypothesis is true. p Values should, thereby, not be viewed as objective probabilities, since they are based on a premise that may be -- and probably is -- entirely false.
Nevertheless, hypothesis testing continues to enjoy almost universal use, and is so entrenched in research and publication that it will probably live on forever. More importantly, when combined with other forms of information (such as descriptive statistics, estimation, biological reasoning, and so on), hypothesis testing provides important safeguards against imprudent conclusions. It is therefore our responsibility to interpret hypothesis tests correctly, and apply them when needed.
But what of the retained null hypothesis? Might this be false as well? Of course it can, and a false retention of H0 is called a type II error:
|H0 is true||H0 is false|
|Retain H0||Correct Retention||**Type II Error!**|
|Reject H0||Type I Error||Correct Rejection|
The probability of a type II error is called beta (b), and the complement of beta (the probability of avoiding a type II error) is called power (1 - b). Studies with inadequate power are a waste of time, money, and resources. Since power is a function of sample size, it behooves the researcher to collect enough data during his or her studies. For an introduction on how to determine an adequate sample size, see http://www.tufts.edu/~gdallal/SIZE.HTM .
Abelson, in his excellent book Statistics as Principled Argument (1995), suggests that the presentation of statistical results importantly entails rhetoric. The virtues of a good statistician, therefore, involve not only the skills of a good detective, but also the skills of a good storyteller. As a good story teller, it is essential to argue flexibly and in detail for a particular case; data analysis should not be pointlessly formal. Rather, it should make an interesting claim by telling a tale that an informed audience will care about, doing so through an intelligent interpretation of data.
Reporting and presenting results are important parts of a statistician's job. In general, the statistician should always use judgement when reporting statistics, and always report findings in a way this is consistent with what he or she wishes to learn. With this in mind, here are some guidelines for reporting statistics:
Abelson R. P. (1995). Statistics as Principled Argument. Hillsdale, NJ: Lawrence Erlbaum Associates.
American Psychological Association [APA]. (1994). Publication Manual (4th ed.). Washington, DC: Author.
Bailar, J. C. & Mosteller, F. (1988). Guidelines for statistical reporting in articles for medical journals. Annals of Internal Medicine, 108, 266 - 273.
Bellhouse, D. R. (1993). Invited commentary: p values, hypothesis tests and likelihood. American Journal of Epidemiology, 137, 497 - 499.
Cohen, J. (1994). The earth is round (p < .05). American Psychologist, 49, 997 - 1003.
Dallal, G. E. (1997). Sample Size Calculations Simplified. http://www.tufts.edu/~gdallal/SIZE.HTM
Dallal, G. E. (1997). Some Aspects of Study Design. http://www.tufts.edu/~gdallal/STUDY.HTM
Fisher, R. A. (1935). The logic of inductive inference. Journal of the Royal Statistical Society, 98, 39 - 54.
Fisher, R. (1973). Statistical Methods and Scientific Inference. (3rd ed.). New York: Macmillan.
Goodman, S. N. (1993). P values, hypothesis tests, and likelihood: implications for epidemiology of a neglected historical debate. American Journal of Epidemiology, 137, 485 - 496.
International Committee of Medical Journal Editors [International Committee]. (1988). Uniform requirements for manuscripts submitted to biomedical journals. Annals of Internal Medicine, 108: 258 - 265.
Lehmann, E. L. (1993). The Fisher, Neyman-Pearson theories of testing hypotheses: one theory or two? Journal of the American Statistical Association, 88, 1242 - 1249.
Tukey, J. W. (1991). The philosophy of multiple comparisons. Statistical Science, 6, 100 - 116.