The opportunities (and challenges) of using shared clinical trial data to better understand drug safety

Joshua D Wallach, Harlan M Krumholz, and Joseph S Ross share the lessons they’ve learnt from using clinical trial data shared through secure platforms for their own research

Over the past decade, shared clinical trial data has increasingly been seen as an important step towards open, reproducible science. Research sponsors and investigators have increasingly shared clinical trial data, with initiatives to support this launched by industry, government agencies, and other organizations.1-3 These efforts have provided independent investigators access to de-identified individual patient level data (IPD) and study documentation from thousands of clinical studies for their own research studies. These data sharing initiatives foster research transparency,3 4 and have generated opportunities for partnership and collaboration. They have also facilitated meta-analyses and secondary analyses to examine population subgroups or previously unreported study endpoints, along with independent validation of previously published research findings.5-7 Our recent experience has highlighted challenges related to using shared IPD that are worthy of attention.8

In February 2020,8 we published a study in The BMJ in which we used shared data made available to external researchers by one of these initiatives, (CSDR), including IPD from 34 clinical trials sponsored by GlaxoSmithKline (GSK), along with summary level data from 103 trials that we identified in the medical literature.8 We conducted a series of meta-analyses in pursuit of three aims: to estimate the cardiovascular and mortality risk associated with rosiglitazone; to determine whether different analytical approaches were likely to alter the conclusions of adverse event (AE) meta-analyses; and to inform efforts to promote clinical trial transparency and data sharing.8 Although we concluded that rosiglitazone was associated with increased cardiovascular risk, specifically risk for heart failure, we explained that uncertainties regarding the precision of the safety estimates may not be fully resolved because we observed different magnitudes of risk across various data sources. Given that we identified more myocardial infarctions and fewer cardiovascular deaths in the IPD when compared with the number of events that had previously been reported in summary data sources,9 10 we suggested that IPD might be necessary to accurately classify all AEs when performing meta-analyses focused on safety. 

After our study was published, we received questions about the number of AEs we reported for two of the trials included in our analyses (ADOPT, referred to internally as trial 49654/048, and the RECORD study), including whether the number of AEs that we had identified were the result of an undercount, either as a result of our error or the data we used.11 One of the challenges of working with shared data is that answering such questions is surprisingly difficult.

First, nearly all clinical trial data sharing platforms provide access to IPD through a secure, virtual data server through which researchers can manage data and conduct statistical analyses. Moreover, when requesting access to IPD, the original data files are shared via this environment for a specific time, as agreed upon in a data use agreement (DUA) (e.g. 12 months). Therefore, although all data files are available in a secure environment for the DUA duration, investigators can only save statistical code and summary results on their computers. Data are never downloaded or stored on a personal computer. 

For our study, after analyzing the data shared by GSK for 34 trials over several years through the secure data sharing and analysis environment maintained by, around the time we submitted our article for publication, our DUA ended. Complicating matters, at this same time, GSK decided to move its shared data off of to a new platform. Thus, after submitting our article for publication, but not before The BMJ had accepted it, all the IPD and analysis files we had worked with and made modifications to were no longer available to us. While we would have been able to re-request access to these data on the new platform, when questions were raised about the number of events we reported in the published manuscript, because we no longer had access to the originally shared data files, we were unable to determine if there was an error in our analysis or in the files that we received as part of our original data request. Moreover, we thus could not rerun our original analyses to determine why we did not capture certain fatal AEs in our original assessment. Our experience demonstrates how important it is for investigators working on research using data sharing platforms to retain access to the original files for some reasonable period, as otherwise, it is difficult to respond to enquiries that may arise after the results are published.

Second, clinical trials are shared not as a single data file, but as a series of related data files, each of which provides different information: baseline characteristics, information collected at specific follow-up visits (such as efficacy endpoints), AEs, and sometimes even files specific to patient deaths. Even among trials from a single sponsor for a single product, different data standards and terminology can complicate secondary analyses. This fact may be a reason why different groups estimate different results using the same database. We used AE files to generate frequency tables summarizing all of the individual events for our study. AE files are typically categorized using Medical Dictionary for Regulatory Activities (MedDRA), a subscription-based medical dictionary of sorts to harmonize and standardize medical terminology to facilitate sharing regulatory information for medical product evaluations. These terms are organized using a five-level hierarchy, including 27 System Organ Classes, 337 High-Level Group Terms, 1737 High-Level Terms, 24 289 Preferred Terms, and
81 812 Low-Level Terms, the latter of which offer the most granularity.12 Based on the clinical judgment of our study team, we had systematically identified all “high-level” and “preferred” MedDRA terms for myocardial infarction and heart failure events, using two different levels of the hierarchy because the 34 clinical trials differed in which terms were used for AE reporting. We then reviewed the AE files for cardiovascular and non-cardiovascular related deaths, and when available, used the death files that were provided to us to confirm these fatal events. 

After we received questions about our study, we re-reviewed our fatal adverse event counts. We discovered that there were 4 trials (ADOPT, AVA102670, AVA102672, and AVD100521) for which we undercounted the total number of deaths. For the ADOPT trial, we believe it is most likely that we did not receive a complete death file as part of the originally shared IPD, which can happen because of anonymization techniques used to remove any personal identifiable information. For the other three trials, no death files were ever available in the IPD, necessitating AE files to identify deaths. While deaths can be ascertained from the AE files, reporting is not always consistent across trials. Different variables may be used to classify events as “deaths,” as “fatal” AEs, or as “fatal outcomes,” which can thus require different approaches to identification. Investigators should be aware of the challenges in identifying deaths when using AE files. Ideally, death files would be uniformly prepared for every clinical trial, even when there are few. And investigators should take steps to compare the number of deaths reported in trial publications or on trial registries (e.g., with the numbers they determined when using AE files from shared IPD.  

Third, there are numerous challenges to determining “which data source is right” when it comes to cross checking the findings generated using shared IPD. Data from clinical studies may be reported in multiple sources, including trial registries, publications, and clinical study reports (CSRs) or scientific review summaries (SRSs) prepared by trial sponsors. Ideally, the results reported across these sources should be consistent. However, studies comparing these sources have found inconsistencies.13 14 Moreover, publicly available sources, such as publications and trial registrations, are more likely to be incomplete,15 16 and concerns have been raised about the inadequate reporting of AEs in the medical literature.17-19 For our study, for trial AVD100521, the results reported on and the trial CSR concluded that there were 15 all-cause deaths among patients enrolled in the trial. However, according to the new IPD AE file, 14 unique individuals had events that were classified as “fatal.” While a difference of one death may not seem to be a problem, the discrepancies can be much greater. For the ADOPT trial, the SRS described 48 fatal serious AE reported during treatment and up to 30 days following the last dose of study medication. In contrast, the CSR, which redacted all information on causes of death, reported 96 total deaths, 48 of which were noted during the “On-Therapy” study period. 

Determining “which data source is right” can be even more challenging for non-death endpoints. There have been numerous evaluations for the RECORD study, using different approaches and data sources, with contrasting estimates of the number of patients who experienced a myocardial infarction, heart failure, and cardiovascular related death.20-23 In fact, this controversy is one reason why our original analyses were conducted excluding (primary analyses) and including (secondary analyses) the RECORD study. We also encountered difficulties in determining the number of unique myocardial infarction or heart failure events when re-reviewing the CSRs, which did not always differentiate whether individuals experienced multiple event types, such as both a myocardial infarction and heart failure. Our experience highlights the difficulty of classifying and comparing all AE events across different data sources, and suggests that it may be difficult for investigators evaluating and synthesizing evidence to determine which data sources are correct, indicating the need to present results using different approaches.

We are reassured that our overarching conclusions are consistent with respect to estimating the cardiovascular and mortality risk associated with rosiglitazone. However, we had not anticipated the number of important considerations and lessons learnt that would come to be illustrated by our study, informing efforts for any investigators planning to use clinical trial data shared through secure platforms for their own research. For ongoing and future data requests, original data files and statistical code used on data sharing platforms to analyze shared data should remain available to investigators, for some reasonable period (e.g. five years following publication). Although IPD should be used to carefully reproduce the counts (or previous findings) reported in CSRs and/or publications, it may be difficult to establish which findings are correct. In these situations, additional conversations with the data providers can help clarify uncertainties and sensitivity analyses can be used to determine the consistency of results using different estimates. Lastly, opportunities exist for more consistent data standards, especially among trials from a single sponsor for a single product. Ultimately, clinical trial data sharing initiatives promote clinical trial transparency and foster secondary research, maximizing the potential knowledge that can be learnt from any given study. But they are not without challenges.

Joshua D Wallach, assistant professor, Department of Environmental Health Sciences, Yale School of Public Health. Twitter @JoshuaDWallach

Harlan M Krumholz, professor, Section of Cardiovascular Medicine and the National Clinician Scholars Program, Department of Internal Medicine, Yale School of Medicine; Department of Health Policy and Management, Yale School of Public Health; and Center for Outcomes Research and Evaluation, Yale-New Haven Health System. Twitter @hmkyale

Joseph S Ross, professor, Section of General Medicine and the National Clinician Scholars Program, Department of Internal Medicine, Yale School of Medicine; Department of Health Policy and Management, Yale School of Public Health; and Center for Outcomes Research and Evaluation, Yale-New Haven Health System. Twitter @jsross119

Competing interests: In the past 36 months, JDW received research support through the Collaboration for Research Integrity and Transparency at Yale, funded by the Laura and John Arnold Foundation, the Yale-Mayo Clinic Center for Excellence in Regulatory Science and Innovation (CERSI; U01FD005938), and the National Institutes of Health (NIH)/National Institute on Alcohol Abuse and Alcoholism (K01AA028258). The content is solely the responsibility of the authors and does not necessarily represent the official views of the National Institutes of Health.

HMK received research support through Yale from Johnson & Johnson to develop methods of clinical trial data sharing, from Medtronic and the Food and Drug Administration (FDA) to develop methods for postmarket surveillance of medical devices (U01FD004585), from the Centers of Medicare and Medicaid Services to develop and maintain performance measures that are used for public reporting, received payment from the Arnold & Porter Law Firm for work related to the Sanofi clopidogrel litigation and from the Ben C. Martin Law Firm for work related to the Cook IVC filter litigation, chairs a Cardiac Scientific Advisory Board for UnitedHealth, is a participant/participant representative of the IBM Watson Health Life Sciences Board, is a member of the Advisory Board for Element Science and the Physician Advisory Board for Aetna, and is the founder of Hugo, a personal health information platform.

JSR received research support through Yale from Johnson & Johnson to develop methods of clinical trial data sharing, from Medtronic and the FDA to develop methods for postmarket surveillance of medical devices (U01FD004585), from the Centers of Medicare and Medicaid Services to develop and maintain performance measures that are used for public reporting, from the FDA to establish a CERSI at Yale University and the Mayo Clinic (U01FD005938), from the Blue Cross Blue Shield Association to better understand medical technology evaluation, and from the Agency for Healthcare Research and Quality (R01HS022882).


  1. Ross JS, Waldstreicher J, Bamford S, Berlin JA, Childers K, Desai NR, et al. Overview and experience of the YODA Project with clinical trial data sharing after 5 years. Sci Data 2018;5:180268.
  2. Coady SA, Mensah GA, Wagner EL, Goldfarb ME, Hitchcock DM, Giffen CA. Use of the National Heart, Lung, and Blood Institute Data Repository. N Engl J Med 2017;376(19):1849-58.
  3. National Academies of Sciences, Engineering, and Medicine. 2020. Reflections on Sharing Clinical Trial Data: Challenges and a Way Forward: Proceedings of a Workshop. Washington, DC: The National Academies Press.
  4. Ross JS, Krumholz HM. Ushering in a new era of open science through data sharing: the wall must come down. JAMA 2013;309(13):1355-6.
  5. Angraal S, Ross JS, Dhruva SS, Desai NR, Welsh JW, Krumholz HM. Merits of Data Sharing: The Digitalis Investigation Group Trial. J Am Coll Cardiol 2017;70(14):1825-7.
  6. Jackevicius CA, An J, Ko DT, Ross JS, Angraal S, Wallach JD, et al. Submissions from the SPRINT Data Analysis Challenge on clinical risk prediction: a cross-sectional evaluation. BMJ Open 2019;9(3):e025936.
  7. Navar AM, Pencina MJ, Rymer JA, Louzao DM, Peterson ED. Use of Open Access Platforms for Clinical Trial Data. JAMA 2016;315(12):1283-4.
  8. Wallach JD, Wang K, Zhang AD, Cheng D, Grossetta Nardini HK, Lin H, et al. Updating insights into rosiglitazone and cardiovascular risk through shared data: individual patient and summary level meta-analyses. BMJ 2020;368:l7078.
  9. Nissen SE, Wolski K. Effect of rosiglitazone on the risk of myocardial infarction and death from cardiovascular causes. N Engl J Med 2007;356(24):2457-71.
  10. Nissen SE, Wolski K. Rosiglitazone revisited: an updated meta-analysis of risk for myocardial infarction and cardiovascular mortality. Arch Intern Med 2010;170(14):1191-201.
  11. Marciniak T. Meta-analysis is flawed. BMJ Rapid Response https://wwwbmjcom/content/368/bmjl7078/rapid-responses2021.
  12. MedDRA.
  13. Becker JE, Ross JS. Reporting discrepancies between the results database and peer-reviewed publications. Ann Intern Med 2014;161(10):760.
  14. Schwartz LM, Woloshin S, Zheng E, Tse T, Zarin DA. and Drugs@FDA: A Comparison of Results Reporting for New Drug Approval Trials. Ann Intern Med 2016;165(6):421-30.
  15. Mayo-Wilson E, Hutfless S, Li T, Gresham G, Fusco N, Ehmsen J, et al. Integrating multiple data sources (MUDS) for meta-analysis to improve patient-centered outcomes research: a protocol for a systematic review. Syst Rev 2015;4:143.
  16. Mayo-Wilson E, Li T, Fusco N, Dickersin K, investigators M. Practical guidance for using multiple data sources in systematic reviews and meta-analyses (with examples from the MUDS study). Res Synth Methods 2018;9(1):2-12.
  17. Ioannidis JP. Adverse events in randomized trials: neglected, restricted, distorted, and silenced. Arch Intern Med 2009;169(19):1737-9.
  18. Ioannidis JP, Lau J. Completeness of safety reporting in randomized trials: an evaluation of 7 medical areas. JAMA 2001;285(4):437-43.
  19. Mayo-Wilson E, Fusco N, Li T, Hong H, Canner JK, Dickersin K, et al. Harms are assessed inconsistently and reported inadequately part 1: systematic adverse events. J Clin Epidemiol 2019;113:20-7.
  20. Home PD, Pocock SJ, Beck-Nielsen H, Gomis R, Hanefeld M, Jones NP, et al. Rosiglitazone evaluated for cardiovascular outcomes–an interim analysis. N Engl J Med 2007;357(1):28-38.
  21. Home PD, Pocock SJ, Beck-Nielsen H, Curtis PS, Gomis R, Hanefeld M, et al. Rosiglitazone evaluated for cardiovascular outcomes in oral agent combination therapy for type 2 diabetes (RECORD): a multicentre, randomised, open-label trial. Lancet 2009;373(9681):2125-35.
  22. Staff Report on GlaxoSmithKline and the Diabetes Drug Avandia. Prepared by the staff of the Committee on Finance, United States Senate, Max Baucus, Chairman, Chuck Grassley, Ranking Member. Accessed January 1, 2021.
  23. Mahaffey KW, Hafley G, Dickerson S, Burns S, Tourt-Uhlig S, White J, et al. Results of a reevaluation of cardiovascular outcomes in the RECORD trial. Am Heart J 2013;166(2):240-9.e1.