Notes on Data Cleaning

B. Gerstman (February, 1999)

Data cleaning is defined as "the process of excluding the information in incomplete, inconsistent records or irrelevant information collected in a survey or other form of epidemiologic study before analysis begins" (Last, 1995). The aim of data cleaning is to produce a file that is ready for analysis. In addition, the early detection of incomplete and inconsistent data could allow the researcher to re-interview subjects or seek to correct other types of apparent incongruities. When correction of the initial data is impossible, however, data must be cleaned in some other fashion. In general, three tactics exist. These are:

Regardless of the method used to clean data, the procedure should be laid in the study's protocol, and should be followed throughout the investigation.