By Stanley E. Lazic (@StanLazic)
Replication is a key idea in science and statistics, but is often misunderstood by researchers because they receive little education or training on experimental design. Consequently, the wrong entity is replicated in many experiments, leading to pseudoreplication or the “unit of analysis” problem [1,2]. This results in exaggerated sample sizes and a potential increase in both false positives and false negatives – the worst of all possible worlds.
Replication can mean many things
Replication is not always easy to understand because many parts of an experiment can be replicated, and a non-exhaustive list includes:
- Replicating the measurements taken on a set of samples. Examples include taking two blood pressure readings on each person or dividing a blood sample into two aliquots and measuring the concentration of a substance in each aliquot.
- Replicating the application of a treatment or intervention to a biological entity of interest. This is the traditional way of increasing the sample size, by increasing the number of treatment–entity pairs; for example, the number of times a drug or vehicle control is randomly and independently applied to a set of rats.
- Replicating the experimental procedure or protocol under identical conditions. Cell culture experiments are often repeated on different days, where on each day a complete mini-experiment is done, and the aim is to keep the experimental conditions constant on each occasion.
- Replicating the experimental procedure under different conditions. Repeating the experimental procedure several times, but where a known source of variation is present on each occasion. An example is a multi-centre clinical trial where differences between centres may exist. Another example is a large animal experiment that is broken down into two smaller experiments to make it manageable, and each smaller experiment is run by a different technician.
- Replicating the experiment by independent researchers. Repeating the whole experiment by researchers that were not part of the initial experiment. This occurs when a paper is published and others try to obtain the same results.
To add to the confusion, terms with related meanings exist, such as repeatability, reproducibility, and replicability. Furthermore, the reasons for having or increasing replication are diverse and include a need to increase statistical power, a desire to make the results more generalisable, or the result of a practical constraint, such as an inability to recruit enough patients in one centre and so multiple centres are needed.
Requirements for genuine replication
How do you design an experiment to have genuine replication and not pseudoreplication? First, ensure that replication is at the level of the biological question or scientific hypothesis. For example, to test the effectiveness of a drug in rats, give the drug to multiple rats, and compare the result with other rats that received a control treatment (corresponding to example 2 above). Multiple measurements on each rat (example 1 above) do not count towards genuine replication.
To test if a drug kills proliferating cells in a well compared to a control condition, you will need multiple drug and control wells, since the drug is applied on a per-well basis. But you may worry that the results from a single experimental run will not generalise – even if you can perform a valid statistical test – because results from in vitro experiments can be highly variable. You could then repeat the experiment four times (corresponding to example 3 above), and the sample size is now four, not the total number of wells that were used across all of the experimental runs. This second option requires more work, will take longer, and will usually have lower power, but it provides a more robust result because the experimenter’s ability to reproduce the treatment effect across multiple experimental runs has been replicated.
To test if pre-registered studies report different effect sizes from traditional studies that are not pre-registered, you will need multiple studies of both types (corresponding to example 5 above). The number of subjects in each of these studies is irrelevant for testing this study-level hypothesis.
Replication at the level of the question or hypothesis a necessary but not sufficient condition for genuine replication – three criteria must be satisfied [1,3]:
- For experiments, the biological entities of interest must be randomly and independently assigned to treatment groups. If this criterion holds, the biological entities are also called the experimental units [1,3].
- The treatment(s) should be applied independently to each experimental unit. Injecting animals with a drug is an independent application of a treatment, whereas putting the drug in the drinking water shared by all animals in a cage is not.
- The experimental units should not influence each other, especially on the measured outcome variables. This criterion is often impossible to verify – how do you prove that the aggressive behaviour of one rat in a cage is not influencing the behaviour of the other rats?
It follows that cells in a well or neurons in a brain or slice culture can rarely be considered genuine replicates because the above criteria are unlikely to be met, whereas fish in a tank, rats in a cage, or pigs in a pen could be genuine replicates in some cases but not in others. If the criteria are not met, the solution is to replicate one level up in the biological or technical hierarchy. For example, if you’re interested in the effect of a drug on cells in an in vitro experiment, but cannot use cells as genuine replicates, then the number of wells can be the replicates, and the measurements on cells within a well can be averaged so that the number of data points corresponds to the number of wells, that is, the sample size (hierarchical or multi-level models can also be used and don’t require values to be averaged because they take the structure of the data into account, but they are harder to implement and interpret compared with averaging followed by simpler statistical methods). Similarly, if rats in a cage cannot be considered genuine replicates, then calculating a cage-averaged value and using cages as genuine replicates is an appropriate solution (or a multi-level model).
If genuine replication is too low, the experiment may be unable to answer any scientific questions of interest. Therefore issues about replication must be resolved when designing an experiment, not after the data have been collected. For example, if cages are the genuine replicates and not the rats, then putting fewer rats in a cage and having more cages will increase power; and power is maximised with one rat per cage, but this may be undesirable for other reasons.
Confusing pseudoreplication for genuine replication reduces our ability to learn from experiments, understand nature, and develop treatments for diseases. It is also easily fixed. The requirements for genuine replication, like the definition of a p-value, is often misunderstood by researchers, despite many papers on the topic. An open-access overview is provided in reference , and reference  has a detailed discussion along with analysis options for many experimental designs.
 Lazic SE, Clarke-Williams CJ, Munafo MR (2018). What exactly is “N” in cell culture and animal experiments? PLoS Biol 6(4):e2005282. https://doi.org/10.1371/journal.pbio.2005282
 Lazic SE (2010). The problem of pseudoreplication in neuroscientific studies: is it affecting your analysis? BMC Neuroscience 11:5. https://doi.org/10.1186/1471-2202-11-5
 Lazic SE (2016). Experimental Design for Laboratory Biologists: Maximising Information and Improving Reproducibility. Cambridge University Press, Cambridge, UK. https://www.cambridge.org/Lazic
Stanley E. Lazic is Co-founder and Chief Scientific Officer at Prioris.ai Inc.
Suite 459, 207 Bank Street,
Ottawa ON, K2P 2N2,