Data can be lost or go missing for lots of different reasons, and it’s quite important to know why as it might make you fundamentally muck-up the results of your study of you deal with it badly.
The most obvious reason for data to get lost is by bad luck, for example a freak accident like power failure in the lab meaning a blood test can’t be analysed. In this setting, the data are missing for no reason but random chance, and are described as “missing completely at random” (MCAR). At the other end of the scale, data may be missing for reasons that are extremely understandable and intimately linked to the patient’s condition, for example having only arterial blood gas measurements on the sickest children in a cohort. These missing values, which are related to know other known and measured factors, are confusingly called “missing not at random” (MNAR). There can also be situations where things are not recorded explain the difference between the missing and the recorded. It may be that patients presenting during the first few weeks of a new resident joining a hospital team are less likely to have all the correct blood tests done as the admitting doctor is not familiar with the study protocol. There will be a systematic difference between those with missing data and those without (new doc vs. old hand) but the reason why the data was missing isn’t linked to anything the researchers can know about (assuming no-one tells them the doctors changed jobs). This type of missing-ness is called “missing at random” (MAR).
While the categorisation of why the data isn’t there is interesting in itself, it also provides a window into opportunities to deal with the problem. MCAR cases can be ignored (using a ‘complete case’ or ‘available case’ analysis) and although it will reduce the number of episodes, it doesn’t introduce bias. Undertaking this type of ‘available case’ analysis when there is a MNAR or MAR problem introduces a form of selection bias. With MAR data, the development of imputation techniques, where the missing elements are replaced by one of a number of reasoned methods, provides a way of avoiding the problems of bias. When it’s MNAR, you’re sort of stuffed, and you have to hope it’s just a teeny tiny bit that’s not there. (For a readable introduction to these techniques, see Donders<!–[if supportFields]> ADDIN EN.CITE <EndNote><Cite ExcludeYear="1"><Author>Donders</Author><Year>2006</Year><RecNum>681</RecNum><record><rec-number>681</rec-number><foreign-keys><key app="EN" db-id="v9f0rzat49wrd9etwatvtsvfw9app055xps0">681</key></foreign-keys><ref-type name="Journal Article">17</ref-type><contributors><authors><author>Donders, A. R.</author><author>van der Heijden, G. J.</author><author>Stijnen, T.</author><author>Moons, K. G.</author></authors></contributors><auth-address>Center for Biostatistics, Utrecht University, Utrecht, The Netherlands. R.Donders@geo.uu.nl</auth-address><titles><title>Review: a gentle introduction to imputation of missing values</title><secondary-title>J Clin Epidemiol</secondary-title></titles><periodical><full-title>J Clin Epidemiol</full-title></periodical><pages>1087-91</pages><volume>59</volume><number>10</number><edition>2006/09/19</edition><keywords><keyword>Bias (Epidemiology)</keyword><keyword>*Data Interpretation, Statistical</keyword><keyword>Humans</keyword><keyword>Logistic Models</keyword><keyword>Research Design</keyword></keywords><dates><year>2006</year><pub-dates><date>Oct</date></pub-dates></dates><isbn>0895-4356 (Print)</isbn><accession-num>16980149</accession-num><urls><related-urls><url>http://www.ncbi.nlm.nih.gov/entrez/query.fcgi?cmd=Retrieve&db=PubMed&dopt=Citation&list_uids=16980149</url></related-urls></urls><electronic-resource-num>S0895-4356(06)00197-1 [pii]
1. Donders AR, van der Heijden GJ, Stijnen T, Moons KG. Review: a gentle introduction to imputation of missing values. J Clin Epidemiol 2006;59(10):1087-91.
Acknowledgement: Photo from Kotaku.com and released under a CreativeCommons Licence