Why standardisation threatens reproducibility

BMJ Open Science encourages
a number of initiatives to help the work that we publish be reproducible such as pre-submission manuscript checking, encouraging reporting guidelines, and asking authors to report the strengths and limitations of their experiments. However, reproducibility should also be considered at the study design stage. A recent studybased on extensive preclinical data emphasises the importance of heterogeneous study samples for improving reproducibility.

By Hanno Würbel @HannoWurbel

Most preclinical animal research relies on single-laboratory studies conducted under rigorously standardised conditions. However, the results of such studies may only be valid under the specific standardised conditions, thereby compromising reproducibility. To examine whether more heterogeneous study samples improve reproducibility, we compared single-laboratory with multi-laboratory studies by sampling data from 440 preclinical studies on 13 different treatments in animal models of stroke, heart attack, and breast cancer. To create heterogeneous study samples, we simulated multi-laboratory studies by combining data from two, three, or four studies from different laboratories, as if they had been conducted in parallel. Indeed, results of single-laboratory studies varied much more widely between studies compared to those of multi-laboratory studies. Importantly, multi-laboratory studies improved reproducibility without a need for larger sample sizes, indicating that robust evidence can be obtained with fewer animals.

Why do multi-laboratory studies improve reproducibility?

The reason why results of multi-laboratory studies have better reproducibility is that they account for the heterogeneity that exists between independent studies. Since their study samples are more representative of the population of independent studies, they are less likely to yield idiosyncratic results.

In a blog post2 on our paper, DrugMonkey argued that our findings demonstrate better generalisation of results, not reproducibility. “They continually misuse reproducibility when they really mean to refer to generalisation. And this harms science”, DrugMonkey complained. They have a point in that our findings depend on better generalisation of results from multi-laboratory studies. However, they fail to acknowledge that reproducibility also depends on generalisability – at least in the real world.

In theory, one might expect to find the same result if a study was replicated under exactly the same conditions. However, the conditions of any two studies can never be exactly the same. John Crabbe and colleagues3 once set out to conduct the same study – testing behavioural differences between inbred and mutant strains of mice – in exactly the same way in three different laboratories. They went to great length to harmonise the three studies by ordering mice from the same breeders, housing them under the same conditions, and using the same protocol for testing. They failed miserably, concluding, “Experiments characterizing mutants may yield results that are idiosyncratic to a particular laboratory.”

The standardisation fallacy

That standardisation may be bad for reproducibility appears counterintuitive to most scientists. However, differences between laboratories are unavoidable – the animals are different, the people interacting with the animals, the animals’ microbiome, their sensory perceptions and experiences, all of which may affect the animals’ phenotype and thus the outcome of the study. Therefore, different laboratories inherently standardise to different local study conditions4.

For results to be reproducible across independent studies, they need to generalise to at least such unavoidable differences between study conditions. This requires heterogenisation of study conditions, not standardisation. Attempts to enhance reproducibility through ever more rigorous standardisation are reminiscent of Alice running on the spot in Lewis Carroll’s “Through the Looking-Glass”5. With every additional variable one specifies and standardises within a study, the range of conditions under which the results will be valid (the external validity) narrows down. Therefore, with increasingly rigorous standardisation, the results in different laboratories will become more and more distinct and replicate studies more and more unlikely to yield the same result – a typical fallacy. The standardisation fallacy6 is the attempt to enhance reproducibility at the expense of external validity. Since reproducibility is a function of external validity, this approach is doomed to fail.

Time for a paradigm shift

Historically, standardisation was introduced for good reasons; it was meant to avoid or control for confounding factors to guarantee the internal validity of inferences. It only became a problem when scientists began using standardisation to get rid of biological variation. Henceforth, the aim was to render study populations ever more homogenous through genetic and environmental standardisation. Standardisation thus lost its original meaning, turning into homogenisation instead. The idea was that results would become cleaner, more precise; and precision – so it was hoped – should guarantee reproducibility. It turned out to be false hopes.

In contrast to other sources of poor reproducibility (e.g. p-hacking, small sample size, publication bias, and fraud), which represent violations of good laboratory practice, standardisation (aka homogenisation) counts as good laboratory practice7. Because of this and because of its historical foundation and intuitive appeal, the standardisation fallacy is more difficult to overcome. A paradigm shift is needed; instead of trying to spirit biological variation away, we need to start embracing it8.

How to overcome the standardisation fallacy?

There is not a single solution. Experimental design is the very art of the experimental sciences, and the best studies are designed to their specific aims and contexts. There is even room for rigorously standardised studies in exploratory research or proof-of-concept studies that do not aim at generalisation. And for those that do, multi-laboratory studies are not the only, perhaps not even the best, solution. Multi-laboratory studies introduce variation in an uncontrolled manner; systematic variation of relevant factors within single-laboratory studies may represent a better-controlled way of heterogenisation9. Michael Festing made a similar point in favour of using multiple inbred strains of mice instead of a single outbred strain10. Furthermore, where the extent of between-laboratory variation is known because many laboratories use similar methodologies (e.g. in behavioural phenotyping of mouse mutants), such variation may be accounted for statistically by including a treatment-by-laboratory interaction term in the statistical model11. However, none of this can replace the need for critical evaluation and reporting of the scope and limitations of study results, including their external validity. Such an evaluation should also form part of the ethical harm-benefit analysis of animal research protocols12.


1 Voelkl, B., Vogt, L., Sena, E.S., Würbel, H. 2018. Reproducibility of preclinical animal research improves with heterogeneity of study samples. PLOS Biol. 16(2):e2003693. https://doi.org/10.1371/journal.pbio.2003693

2 DrugMonkey (2018, February 26). Generalization, not “reproducibility” [Blog post]. Retrieved from http://drugmonkey.scientopia.org/2018/02/26/generalization-not-reproducibility/

3 Crabbe J.C., Wahlsten, D., Dudek, B.C. 1999. Genetics of mouse behavior: Interactions with laboratory environment. Science 284: 1670–1672.

4 Richter, S.H., Garner, J.P., Würbel, H. 2009. Environmental standardization: Cure or cause of poor reproducibility in animal experiments? Nat. Meth. 6: 257–261.

5 Würbel, H., Garner, J.P. 2007. Refinement of rodent research through environmental enrichment and systematic randomization. NC3Rs #9. http://www.nc3rs.org.uk

6 Würbel, H. 2000. Behaviour and the standardization fallacy. Nat. Genet. 26: 263.

7 Beynen, A.C., Gärtner, K., van Zutphen, L.F.M. 2003. Standardization of animal experimentation. In: Zutphen, L.F.M., Baumans, V., Beynen, A.C., editors. Principles of laboratory animal science.  2nd ed. Amsterdam: Elsevier Ltd. pp. 103–110.

8 Karp, N.A. 2018. Reproducible preclinical research—Is embracing variability the answer? PLOS Biol. 16(3): e2005413.

9 Richter, S.H., Garner, J.P., Auer, C., Kunert, J., Würbel, H. 2010. Systematic variation improves reproducibility of animal experiments. Nat. Meth. 7: 167–168.

10 Festing, M.F. 2010. Inbred strains should replace outbred stocks in toxicology, safety testing, and drug development. Toxicol. Pathol. 38(5): 681-90.

11 Kafkafi, N., Golani, I., Jaljuli, I., Morgan, H., Sarig, T., Würbel, H., Yaacoby, S., Benjamini, Y. 2017. Addressing reproducibility in single-laboratory phenotyping experiments. Nat. Meth. 14: 462-464.

12 Würbel, H. 2017. More than 3Rs: The importance of scientific validity for harm-benefit analysis of animal research. Lab Anim. 46: 164–166.

Hanno Würbel is a Professor of Animal Welfare at the VPH Institute, Vetsuisse Faculty, University of Bern. He holds grant funding from the European Research Council (ERD Advanced Grant REFINE), EU Horizon 2020 (IMI2 grant EQIPD), and the Swiss Food Safety and Veterinary Office (FSVO).