By Malcolm Macleod (@Maclomaclee)
As researchers, we hope that our research findings are useful – that they inform future research, or lead to changes in policy or practice. Different research designs provide different levels of proof, with experimental evidence generally providing better evidence than observational studies.
Even within research designs, there are factors which might lead one to have greater or less confidence in the reported findings. For animal research, I look for factors such as whether the study was randomised? Was the conduct of the experiment and the assessment of outcome conducted without knowledge of experimental group allocation? Did they set out their hypothesis, their primary outcome measure and their statistical analysis plan in advance, somewhere where I can check? Or might the hypothesis have followed data collection, the outcome measure of interest have changed once the data were in, the statistical test presented be the result of multiple approaches in search of significance?
In the past we have assumed that if a paper was published following peer review then we could have confidence in the reported findings. Unfortunately, across diverse fields systematic reviews find that literatures are at risk of bias, of overstating effects, by virtue of aspects of study design. Another problem is that findings which are novel or surprising are more likely to be published but – given the same statistical power and p threshold – are less likely to be true.
All of this means that we cannot assume that publication, even (perhaps especially) in high impact factor journals, means that the reported findings are secure. As research users, we need to be able to do our own “due diligence” on what has been reported, asking ourselves whether the evidence is good enough for us to use it and move forward, or that we need more data.
And here, we have a problem. Until recently, very few publications describing animal research contained the information that I need to do that due diligence. It’s not that papers said they didn’t randomise, it’s that they didn’t mention randomisation at all. We looked at over 1000 papers describing animal research from leading UK institutions, and 68% of them did not mention even one of 4 key elements of study design (the Landis criteria). Others have shown that, even in journals endorsing the ARRIVE checklist, reporting was poor.
These concerns led to two studies from our group with different designs: an observational study of the impact of a change in editorial policy at Nature Publishing Group (NPQIP); and a randomised controlled study of the impact of requested completion of the ARRIVE checklist at submission to PLoS One (IICARUS).
The findings of these studies in essence show that, with considerable investment of editorial time, reporting in NPG journals improved to a much greater extent than it did in non NPG journals over the same period. In contrast, at PLoS One, requested completion of an ARRIVE checklist, without any other editorial intervention, had little if any effect.
Given the sheer volume of submissions to journals such as PLoS One, it is inconceivable that the degree of editorial input enjoyed in lower volume titles could be achieved. Indeed, it may be too much to sustain such efforts even in lower volume titles.
We therefore need to consider other approaches to improving the completeness and transparency of reporting. In doing so we should also learn the other lessons from IICARUS and NPQIP. Both studies assessed outcome using reviewers trained to a performance threshold and supported by extensive explanatory materials.
Of course, it may be that the training was not very good, or that the assessors were not very good (I don’t think it’s either of these, but I’m biased!); or it could be that the checklist criteria were not articulated with sufficient clarity, or that the underlying constructs being assessed were not widely or easily understood.
Two initiatives should help improve reporting. Firstly, the NC3Rs are facilitating an update to the ARRIVE checklist for in vivo research, drawing from experience to date, a larger core team, and with built in consultation and road testing.
Secondly, a group of publishers and others are developing a minimum standards framework for reports of biomedical research, drawing on the TOP guidelines. The intention is to present a tiered framework, that journals might choose to target a performance which would represent an improvement on what they currently achieve but which would also be feasible, would be within their reach. The framework will be organised in four domains, Materials; Design; Analysis; and Reporting. A recent series of Blogs (see eLife and Nature blogs) discussed the approach, and invited input from the community. Again, there will be appropriate testing before the framework is finalised, and the development of supporting materials including self-assessment tools and guidance. While ARRIVE 2 deals exclusively with in vivo research and MDAR has a broader canvas, both teams are keen to ensure that the language used will, as far as possible, be aligned.
Improving the usefulness of research is an important objective. I believe everyone – journals, funders, institutions and scientists – has an important role to play. That change may only be incremental should not detract from the scale of the challenge or the urgency with which we seek improvement. But we should not allow this urgency to persuade us to jump to possible solutions without careful evaluation of whether what we are proposing actually works.
Malcolm Macleod is a Professor of Neurology and Translational Neuroscience at the University of Edinburgh. He led the NPQIP study, was involved with IICARUS, and is a member of the groups revising the ARRIVE guidelines and developing the MDAR checklist