You don't need to be signed in to read BMJ Blogs, but you can register here to receive updates about other BMJ products and services via our site.

Liz Wager: Show us the data (part 2)

12 Aug, 13 | by BMJ Group

Liz Wager My last blog started with the observation that it’s impossible to investigate research fraud unless you have the raw data. While that may seem obvious, it leads logically onto another, subtly different, point which often seems to be missed: that it’s impossible to spot many types of research fraud unless you have seen the raw data. Some problems, such as plagiarism, or blatant image manipulation, can be picked up by keen-eyed reviewers or editors, especially if they use screening tools such as CrossCheck. But fabricated or falsified data usually cannot be spotted from the aggregate data reported in journal articles.

For example, imagine I report that I had studied 100 patients (or rats), given half of them one treatment, and half another, and then measured their blood pressure after one and three months. In the publication, these findings would probably be reported as an average with a measure of statistical distribution, such as the standard deviation (SD). The report might also include the average age and weight (+/- SD) of the populations and other key characteristics.

But what if, instead of measuring 50 patients (or rats) in each group, I had measured only five? Or, even worse, none at all? This deception almost certainly would not be apparent from the average figures. Patient (or animal) characteristics could easily be adapted convincingly from other publications. Even implausible data distributions are unlikely to be apparent in a single study—remember that Carlisle analysed 169 publications to show that Yoshitaka Fujii’s data were suspect.

Or suppose I measured 70 in each group (rather than 50) but discarded inconvenient results I regarded as outliers? Or what if I had planned to measure blood pressure at six months, but half the patients had disappeared (or the rats had escaped, or worse, died)? Or what if I thought the six month data were less impressive than the three month findings and therefore failed to mention it. None of these problems could possibly be apparent from the aggregate (i.e. analysed) data.

So, while more stringent peer review may pick up arithmetical errors, and other initiatives, such as the use of reporting guidelines or checklists, can undoubtedly improve the reporting of research methods (which is an area ripe for improvement), and the publication of study protocols, or the recently proposed transparency declarations may reduce selective reporting (such as the missing six month end-point, or the missing rats), it’s unrealistic to expect any of these to detect or prevent deliberate data fabrication.

That’s one reason why The BMJ’s new policy of requiring raw data makes sense. If peer reviewers, editors, and readers can see the raw data, there’s more chance that both fraud and honest errors will be detected. That’s clearly a benefit, the size of which depends on how often fraud and error occur, which, to be honest, we really don’t know at the moment. And the other thing we don’t know yet is how much it costs to format, archive, and curate data, and therefore whether the benefits exceed the costs—but we’re working on this and we won’t know until we’ve tried.

Liz Wager PhD is a freelance medical writer, editor, and trainer. She was the chair of the Committee on Publication Ethics (COPE) 2009-12.

By submitting your comment you agree to adhere to these terms and conditions
  • Philip Jones

    I agree that having free access to raw data is one essential component to reducing research fraud. However, not everyone agrees, or is fully committed to publishing de-identified raw data – even advocates of “open data”.

    I recently contacted an author of several very large randomized controlled trials. This person is also an advocate of “open data” and has published the raw data from a couple of large trials on his website. I downloaded the raw data and quickly discovered that the treatment allocation variable was missing, and was only available to those who submit a protocol (presumably approved by the author himself) for “new” research. Although I totally agree that “new” research should have a protocol, I objected to the notions that (a) the researcher would decide by fiat whether or not a particular protocol was “worthy”, and (b) new research was the only acceptable usage of the full raw dataset (since I think there are many other legitimate uses of the full dataset which do not involve new research, such as teaching, reproducing the original analyses as presented in the paper, fraud reduction, etc.). I mentioned this to him and have not received a reply.

    How to deal with these situations? Do the authors “own” the raw data? Is it ethical to withhold vital variables, such as treatment allocation, which prevents users from using the data for non-research purposes, such as fraud detection, pedagogy, etc?

    Disclosure: I try to practice what I preach and have posted all raw data (including treatment allocations) of my most recent RCTs at FigShare. I plan to continue doing this going forward.

  • Philip Jones

    I agree that having free access to raw data is one essential component to reducing research fraud. However, not everyone agrees, or is fully committed to publishing de-identified raw data – even advocates of “open data”.

    I recently contacted an author of several very large randomized controlled trials. This person is also an advocate of “open data” and has published the raw data from a couple of large trials on his website. I downloaded the raw data and quickly discovered that the treatment allocation variable was missing, and was only available to those who submit a protocol (presumably approved by the author himself) for “new” research. Although I totally agree that “new” research should have a protocol, I objected to the notions that (a) the researcher would decide by fiat whether or not a particular protocol was “worthy”, and (b) new research was the only acceptable usage of the full raw dataset (since I think there are many other legitimate uses of the full dataset which do not involve new research, such as teaching, reproducing the original analyses as presented in the paper, fraud reduction, etc.). I mentioned this to him and have not received a reply.

    How to deal with these situations? Do the authors “own” the raw data? Is it ethical to withhold vital variables, such as treatment allocation, which prevents users from using the data for non-research purposes, such as fraud detection, pedagogy, etc?

    Disclosure: I try to practice what I preach and have posted all raw data (including treatment allocations) of my most recent RCTs at FigShare. I plan to continue doing this going forward.

  • Elizabeth (Liz) Wager

    You raise an important point (and something I’d like to discuss in a future blog). Sharing data is something new and we haven’t yet worked out the best ways to do it. Inevitably there will be some experiments, and not all of these will succeed, and there may also be some odd compromises as we stumble towards finding the optimum methods, which may not always be the best for everybody. My initial reaction to the situation you describe is that it seems rather pointless to share data without the labels needed to understand it, so I understand your frustration … but I’d be interested to hear the researcher’s reasons for doing it this way. I hope others will contribute to this discussion!

  • Stephen John Senn

    All very reasonable but would the BMJ please tell us for what proportion of the papers they publish the statistician reviewer is provided with the raw data? I suspect that the only people who check raw data as a matter of course are statisticians at the FDA.

  • Professor Karen Woolley

    Hi Liz,

    Great post and responses to the comments received thus far…

    I appreciate that the reasons for data sharing go well beyond detecting fabricated or fraudulent data. I am curious though as to how the investments (human, financial, technological) being made to develop robust systems to share raw data would compare with the investments that could be made to conduct more audits? Would those willing to fabricate summarised data be willing to fabricate raw data? Yes, this would take some effort, but one wonders where the limits are for those intent on undermining the integrity of the literature and threatening patient care. As tedious as some days in the life of a CRA or auditor might be, onsite visits to match raw data with real (and verifiable) data have caught out those trying to do the wrong thing.

    Cheers, Karen

You can follow any responses to this entry through the RSS 2.0 feed.
BMJ blogs homepage

The BMJ

Helping doctors make better decisions. Visit site



Creative Comms logo

Latest from The BMJ

Latest from The BMJ

Latest from BMJ podcasts

Latest from BMJ podcasts

Blogs linking here

Blogs linking here