Notes on Data Cleaning
B. Gerstman (February, 1999)
Data cleaning is defined as "the process of excluding the information in incomplete, inconsistent records or irrelevant
information collected in a survey or other form of epidemiologic study before analysis begins" (Last, 1995). The aim of data
cleaning is to produce a file that is ready for analysis. In addition, the early detection of incomplete and inconsistent data
could allow the researcher to re-interview subjects or seek to correct other types of apparent incongruities. When correction
of the initial data is impossible, however, data must be cleaned in some other fashion. In general, three tactics exist. These
are:
- Ignore the apparent error and analyze the data as is. This is analogous to "intention-to-treat" analysis, for it assumes
that errors will balance out on average, distributing themselves randomly. If this is so, your results will be diluted but
unbiased. When errors are not random, however, this approach should probably be avoided.
- Change apparent inconsistencies in values to "missing", but retain the rest of each record for analysis. This often
represents a middle, reasonable course of action.
- Discard the entire record. This is based on the premise of "guilt by association": dirty data should be filtered, because all
of the subject's responses are suspect. This method, should be used only sparingly, since it has, in my opinion, the greatest
potential to introduce a researcher's bias into the process.
Regardless of the method used to clean data, the procedure should be laid in the study's protocol, and should be followed
throughout the investigation.