This blog site has been archived

Assessing bias in studies of harms: a case study of Primodos and congenital malformations


In a recent systematic review, we assessed the use of Primodos, an oral hormone pregnancy test (HPT) marketed between 1958 and 1978, and the associated risk of congenital malformations. This post discusses the assessment of quality in assessing associations of harms. 

Carl Heneghan

We found oral HPTs in pregnancy was associated with an increased chance of all congenital malformations; congenital heart, nervous system, and musculoskeletal malformations; and congenital VACTERL syndrome

To obtain these results we performed a systematic review and meta-analysis of case-control and cohort studies that included pregnant women exposed to oral HPTs within the first three months of pregnancy.

We assessed quality using the Newcastle-Ottawa Scale (NOS) for non-randomized studies. The NOS assesses quality and risk of bias in observational studies, and it has been validated for case-control and longitudinal studies. Whether it was appropriate to use such a method is questionable, because scoring systems to assess quality have previously been criticized.

We chose the NOS has it has been used widely. In our systematic review, we cite five instances, covering a wide range of exposures and outcomes (second malignancies after radiotherapy for prostate cancer, pregnancy, thrombophilia, the risk of a first venous thrombosis, and the relation between low cigarette consumption and the risk of coronary heart disease and stroke).

The use of scales or scores in assessing bias fell out of favour, largely,  because of the Cochrane handbook. Cochrane state it is preferable to use simple approaches for assessing validity that can be fully reported and explicitly discourages scores as there is a strong emphasis on the reporting of the research rather than its conduct. Cochrane further dismisses scales because the approach is not supported by empirical evidence, they are unreliable (Jüni 1999) and because it is often difficult to justify the weights assigned to scores of different items.

Juni, assessed three key domains (concealment of treatment allocation, blinding of outcome assessment, and handling of withdrawals) in a meta-analysis of 17 comparisons of low-molecular-weight heparin with standard heparin for prevention of postoperative thrombosis. They used 25 different scales to identify high-quality trials and found that the scales all differed in their assessments of quality.

Juni state that relevant methodological biases should be assessed individually and their influence on effect sizes explored. However, this is problematic, as, at some point, individual weights will be subjectively assigned by an individual to determine the influence of the biases. They say this because many biases are not supported by empirical evidence, which is true –  see The Catalogue of Bias.

Part of the problem with early scales was they related to the reporting quality, ethics, or interpretation of the results, rather than the internal validity of the trial. Some scales, for example, asked whether the rationale for conducting the study was clearly stated, whether the trialists’ conclusions were compatible with the results, or whether participants had provided written informed consent. Juni also criticised one widely used scale at the time, the JADAD scale, because it gives more weight to the quality of reporting than to actual methodological quality.

Reporting bias, however, is a significant problem in assessing quality. In our assessment of the risk of bias of industry-funded oseltamivir trials, the use of more detailed information, included in clinical study reports, showed that over half (55%, 34/62) of the previous assessments of ‘low’ risk of bias were reclassified as ‘high’.

Strengths and Limitations of  the Newcastle Ottawa Scale

The NOS has been recommended by the Cochrane Collaboration. One weakness of the scale, though, is the possibility of a low agreement between assessors.

Hartling et al showed low agreement between two independent reviewers in scoring the NOS. This was particularly the case when authors had limited experience in doing systematic reviews. Training, even of novices, can improve agreement, and test-retest reliability of the NOS has been shown to be fair to excellent. The developers have also examined the scale’s face validity and criterion validity, inter-rater reliability, and evaluator burden.

In terms of the individual elements, the NOS evaluates the selection of study groups, their comparability, and ascertainment of either the exposure for case-control studies or the outcome of interest for cohort studies.

When the Journal of Evidence-Based Medicine assessed methodological quality assessment tools for systematic review and meta‐analysis, it recommended using the NOS for cohort and case‐control studies. Also, the US Preventive Task Force uses the scale  (Annals of Internal Medicine), and the NOS has recently been used in IPD meta-analysis published in PLOS.

Applying the NOS to our systematic review

In our  systematic review of oral hormone pregnancy tests and the risks of congenital malformations, confounding factors are covered in detail in items 3, 4, 5a, and 5b of the NOS:

  • Item 3 Selection of controls adequate.
  • Item 4 Definition of controls adequate.
  • Item 5 Comparability of cases and controls on the basis of the design or analysis.
  • 5a) Study controls for the most important factor.
  • 5b) Study controls for important additional factors.

Item 5 of the NOS score is particularly important, as it addresses comparability of cases and controls based on design or analysis. Of the 16 case-control studies in our systematic review, 12 controlled for the most important factor (item 5a) and nine controlled for important additional factors (item 5b). Of the ten cohort studies, six controlled for item 5a and four controlled for item 5b. The mean NOS score was 6.1, indicating an overall moderate risk of bias. Table 2 in the systematic review also shows that seven studies did not report the confounding variables collected.

Ascertainment of exposure and outcomes identified by the reviewer are also captured in items 6, 7, and 8:

  • Item 6. Ascertainment of exposure adequate.
  • Item 7. The same method of ascertainment for cases and controls.
  • Item 8. Non-response rate adequate.

What scale should we use?

Most other tools  for assessing harms focus on reporting

The Cochrane Methods group recommend the ROBINS-I is the preferred tool to be used in Cochrane Reviews for non-randomized studies of interventions. However it is not mandatory, and they further state the NOS is an alternative option.

In their full statement they report the currently published ROBINS-I is designed for cohort-like designs, and although it may be applicable for case-control studies, further developments to signalling questions for this design are currently underway but have not yet been published. New guidance to specify the competence level is also in development along with separate software for integration with GRADE; this is not yet available. Until all of this is achieved and tested, we followed the Cochrane recommendation that it is ‘appropriate to use another tool, such as the currently recommended Newcastle Ottawa Scale.’

A Review of quality assessment tools for the evaluation of pharmacoepidemiological safety studies reviewed 61 tools. Most tools were not designed to evaluate pharmacoepidemiological safety studies. and there was no specific tool found that ‘is adequately designed for the robust evaluation of pharmacoepidemiological studies of drug safety.’

The Journal of Evidence-Based Medicine also recommended the Methodological Index for Non-Randomized Studies (MINORS) as an excellent tool for assessing non-randomized interventional studies in surgery.

What study design is appropriate for assessing harms?

Our systematic review included case-control and cohort studies that included data from pregnant women exposed to oral HPTs within the estimated first three months of pregnancy. The CEBM levels of evidence place systematic reviews of randomized trials, systematic reviews of nested case-control studies, n of-1 trial with the patient you are raising the question about, or observational study with dramatic effect as the highest level of evidence to determine an association of harms.

Establishing causal associations in the absence of randomization can be difficult. However, there are situations where randomization is not feasible or ethical. In the case of Primodos, there were already concerns over harms, making it unjustifiable to randomise individuals to such a harmful treatment, particularly when there are also no expected benefits of the treatments – Primodas is a test.

Furthermore, as a test, Primodos does not meet the four concerns, set out by Dave Sackett, about observational evidence that requires randomised trials to negate them. These concerns are:

  1. Clinicians Might Preferentially Give New Treatments to Patients with Better Prognoses
  2. Compliant Patients Might Have Better Prognoses, Regardless of Their Treatment
  3. Patients Who Liked Their Rx Might Report Better Outcomes Unrelated to the True Efficacy of Their Treatments
  4. Clinicians Who Liked Their Rx Might Report Spuriously Better Outcomes Among Patients Who Received Them

It is important to recognise that observational studies can demonstrate associations of harms. As an example, the association of maternal stilbestrol therapy with vaginal adenocarcinoma was established from a small case-control study in which 7 of 8 mothers with carcinoma were treated with diethylstilbestrol during the first trimester. None were treated in the control group.

Concern about observational studies in assessing harms is due to the introduction of bias from uncontrolled confounding. Such confounding by indication, set out in the Catalogue of Bias, is a distortion that modifies an association between an exposure and an outcome. It is caused by the presence of an indication for the exposure that is the real cause of the outcome.

Even after careful matching and adjustment for known risk factors, residual confounding may persist.  An Assessment and control for confounding by indication in observational studies highlight the impact of confounding depends on the prevalence of the confounder, the level of its association with disease and the exposure.  A confounding factor with a prevalence of 20% would have to increase the relative odds of both outcome and exposure by factors of 4 to 5 before the relative risk of 1.57 would be reduced to 1.00.”


To assign a clear association that use of oral HPTs in pregnancy is associated with increased risks of congenital malformations requires a number of factors beyond just a quality assessment.

Using a systematic review of case-control and cohort studies to answer the question of whether Primodos is associated with harms is appropriate because there were ethical issues over the exposure, there were no therapeutic benefits,  no prognostic implications of the exposure, meaning randomization was not appropriate. The use of scoring systems has weaknesses, therefore, it is important to report the individual biases and consider their impact on the effect.  Many pooled analyses had zero heterogeneity, and the direction of effect favoured the controls in 30 of the 32 analyses undertaken. This consistency further strengthens the association.

Carl Heneghan
Editor in Chief BMJ EBM,
Professor of EBM, University of Oxford

Heneghan C, Aronson JK, Spencer E et al. Oral hormone pregnancy tests and the risks of congenital malformations: a systematic review and meta-analysis [version 2; referees: 3 approved]F1000Research 2019, 7:1725

Competing interests

CH is an advisor to  All-Party Parliamentary Group on Hormone Pregnancy Tests, has presented the results of the systematic review at the UK Houses of Parliament,  and The Independent Medicines and Medical Devices Safety Review. CH has received expenses and fees for his media work including BBC Inside Health. He holds grant funding from the NIHR, the NIHR School of Primary Care Research, The NIHR Oxford BRC  and the WHO.  CEBM jointly runs the EvidenceLive Conference with the BMJ and the Overdiagnosis Conference with some international partners which are based on a  non-profit model.

(Visited 1,122 times, 1 visits today)