In January 2012, just before starting my Masters training in epidemiology, my supervisor Brett Thombs, called me to his office to show me a meta-analysis on the accuracy of the Patient Health Questionnaire-9 (PHQ-9) for detecting major depression (Manea et al. CMAJ, 2012). In that meta-analysis, at each possible cutoff, sensitivity and specificity were estimated by analyzing data from all primary studies that reported results for the cutoff. The approach seemed reasonable, but the results seemed strange: sensitivity appeared to improve as the cutoff increased from eight (less severe symptoms) to eleven (more severe symptoms). We realized this must have been due to incomplete reporting in primary studies, because most meta-analyzed cutoffs included data from fewer than half of included primary studies. We discussed how if we could obtain primary data from each study, and combine accuracy results from all cutoffs from all studies, we might be able to disentangle this.
We were not sure this plan was realistic, but I was keen to give it a try. We partnered with Andrea Benedetti, an expert in individual participant data meta-analysis (IPDMA) from McGill, and Brett Thombs bought together some of the world’s leaders in depression screening, diagnostic test accuracy, and IPDMA to form a steering committee.
With this team’s backing, I reached out to the authors of each study in the meta-analysis and requested their primary data. We succeeded in obtaining 13 of 16 eligible datasets.
My master’s thesis (Levis et al, Am J Epidemiol, 2017) used the 13 datasets to compare results of conventional meta-analysis of accuracy estimates from published cutoffs to results of IPDMA of accuracy estimates from all cutoffs from all studies. This was the first publication on any diagnostic test to describe patterns of selective cutoff reporting and how this influences accuracy estimates predictably at different points on the cutoff spectrum in meta-analyses. Results for the recommended cutoff of 10 were reported in almost all studies. Results for other cutoffs, however, were reported in fewer than half of the studies, typically when they were close to the study’s optimal cutoff. Consequently, sensitivity was underestimated for cutoffs below 10, and overestimated for cutoffs above 10. This highlighted that conventional meta-analyses may exaggerate accuracy estimates and underlined the need for larger IPDMAs.
Based on this project, the team obtained funding from the Canadian Institutes of Health Research to conduct a full IPDMA on PHQ-9 diagnostic accuracy, including an updated search.
Our new research paper is the result of this endeavour. We obtained and synthesized 58 of 72 eligible datasets, for a total of 17,357 participants (2,312 major depression cases). Not only were we able to overcome bias due to selective cutoff reporting, but we were also able to address research questions that could not previously be investigated due to the lack of necessary data.
Our IPDMA dataset provided enough data to compare accuracy estimates across studies that used different types of diagnostic interviews as the reference standard. We found that previous meta-analyses may have underestimated accuracy due to combining differently performing reference standards. In addition, although some reports with small datasets have argued for population-specific cutoffs and accuracy results; our IPDMA, the first study to compare subgroups with large participant samples, found no substantive differences across participant subgroups.
This study provides more robust estimates of diagnostic accuracy than have previously been reported and confirms that the standard cutoff of 10 maximizes combined sensitivity and specificity overall and for subgroups. To facilitate understanding for clinicians considering use of the PHQ-9 to screen for depression, we have developed a web tool which can be found at depressionscreening100.com/phq.
Competing interests statement: None declared