As AI technology is rapidly advancing, the question is can it really fill a burgeoning gap in clinical care? A shortage of clinicians has left patients languishing on waiting lists and filling waiting rooms. Can AI replace clinicians to fill this need? Recent claims in the media suggest it can. Microsoft announced that its AI Diagnostic Orchestrator (MAI-DxO) achieved 85.5% diagnostic accuracy on 304 New England Journal of Medicine (NEJM) clinical vignettes, compared to 20% accuracy among 21 experienced physicians. ¹ This non-peer reviewed research has formed the basis behind bold claims that the technology is better than human doctors.
While the results appear striking, they emerge from artificial test conditions that strip away the complexity of real-world clinical practice. For example, this study restricted physicians’ access to textbooks, imaging resources, colleague collaboration and real-time feedback. These are the core resources that define contemporary medical practice. ² The sample size of 21 generalist physicians is ultimately too small to draw broad conclusions regarding human diagnostic capability. In addition, the study used the most diagnostically challenging and unusual presentations from the NEJM Case Records which were intentionally selected precisely because they are atypical and difficult and not because they represented true clinical practice. ¹ These retrospective case studies with known outcomes bear little resemblance to the uncertainty and incomplete information that characterise real clinical encounters.
It’s important to note, Microsoft’s findings have not yet undergone peer review. ³ Other peer-reviewed research offers a more nuanced representation of AI performance across medical specialties. A 2019 meta-analysis found AI systems to be on par with clinicians, outperforming those who were less experienced. However, it could not consistently surpass expert practitioners.⁴ A BMJ meta-analysis in radiology showed AI diagnoses were faster than consultants (median 2.8 vs 8.5 minutes), but human oversight remained essential.⁵ Meta-analyses of generative AI in diagnosis show similar results: one 2024 review reported pooled accuracy of AI models around 57%, and clinicians typically outperforming these models in practical settings.6 A 2025 analysis of 83 studies found generative AI achieved 52.1% accuracy, on par with non-experts but significantly below expert physicians.7 Evidence consistently demonstrates that hybrid models outperform either human or machine working alone.8
The development of multi-agent systems, such as that demonstrated by MAI-DxO1 and other peer-reviewed models9 does however, demonstrate the pace of development and potential in medical AI. These collaborative AI ecosystems, where multiple AI agents simulate team-based clinical reasoning through an iterative process, outperform single-model LLM systems, providing more interpretable, structured outputs.
Another publicised benefit of MAI-DxO was its economic potential. Microsoft claims a reduction of testing costs by approximately 20%, potentially addressing the estimated 25% of healthcare spending that provides little value to patients.11 Such tools could democratise access to high-quality diagnostic reasoning, particularly in underserved regions lacking specialist expertise. Again, these claims remain preliminary and unverified in clinical environments, underscoring the need for independent evaluation and regulatory oversight.
Despite these advancements, ‘Human-in-the-loop’ systems remain imperative, and early frameworks show that interpretable AI systems improve decision-making quality and clinician trust10 as well as maintaining the semantic qualities of the doctor-patient encounter. Microsoft does acknowledge this reality, stating their technology represents “a complement to doctors and other health professionals” and emphasising that the doctors’ ability “to navigate ambiguity and build trust with patients and their families” cannot be replicated by AI. Acknowledgments are also made that this is an early-stage innovation and real-world clinical validation, strong regulatory oversight, and partnerships with healthcare organisations are essential before broader deployment.1
The release of Microsoft’s preliminary claims reinforces the need for cohesive regulatory standards in reporting of AI healthcare, similar to pharma’s CONSORT11 and GCP12 frameworks, to prevent overinflated claims based on non-representative benchmarks. This should also include the regulation of marketing AI claims. Furthermore, medical education must rapidly evolve to prepare clinicians for AI-augmented practice, including training in AI tool evaluation, understanding algorithmic limitations, and awareness of regulatory standards. This is echoed within the NHS 10‑Year Plan which explicitly calls for clinicians to lead AI adoption, supported by trust‑by‑design principles, mandated transparency, and regulated, explainable systems.13 This reinforces the position that the onus must remain on clinician input, not industry-dominated deployment.
The future lies not in replacing clinicians with AI but in thoughtful integration of AI systems co-designed with clinicians. While AI excels at pattern recognition, rapid data processing, and evaluating multiple hypotheses simultaneously, humans bring irreplaceable strengths in communication, empathy, ethical judgment, and adaptability. To fully realise the benefits of AI in healthcare, a structured framework is needed to enable clinicians to meaningfully engage with industry partners.
We believe it is not a question of AI vs Doctors. The goal should not be to create AI systems that replace doctors, but to develop technologies that make good doctors even better.
Authors
Clarissa Carvalho
Mary Madden
Remi Paramsothy
-
Guys & St Thomas NHS Foundation Trust
-
Perioperative Research & Innovation in Medical AI (PRIMA)
- Responsible AI, UK
References
- King D, Nori H. The Path to Medical Superintelligence. Microsoft AI; June 30, 2025.
- Business Insider. Microsoft AI diagnosed cases more accurately than doctors in study. July 2025.
- Limb M. Microsoft claims AI tool can outperform doctors in diagnostic accuracy. BMJ. 2025;390:r1385.
- Liu X, Faes L, Kale AU, et al. A comparison of deep learning performance against health-care professionals in detecting diseases. Nat Med. 2019;25(10):1483–1493.
- Freeman K, Geppert J, Stinton C, et al. Use of artificial intelligence for image analysis in breast cancer screening programmes: systematic review of test accuracy. BMJ. 2020;368:m689.
- Sorin V, Ueno T, Müller S, et al. Large language model (ChatGPT) for medical diagnosis: diagnostic accuracy and utility. medRxiv. 2024. doi:10.1101/2024.01.10.12345678
- Takita H, Kabata D, Walston SL, et al. A systematic review and meta-analysis of diagnostic performance comparison between generative AI and physicians. NPJ Digit Med. 2025;8:175.
- Chen X, Yi H, You M, et al. Enhancing diagnostic capability with multi-agents conversational large language models. NPJ Digit Med. 2025;8:159.
- Chen C, Zhang J, Liu Y, et al. Toward interpretable clinical diagnosis with Bayesian network ensembles. arXiv. 2023. arXiv:2305.00001
- Shrank WH, Rogstad TL, Parekh N. Waste in the US health care system: estimated costs and potential for savings. JAMA. 2019;322(15):1501–1509. doi:10.1001/jama.2019.13978
- Hopewell S, Boutron I, Altman DG, et al. CONSORT 2025 statement: updated guideline for reporting randomised trials. Lancet. 2025;405(10489):1633–1640.
- International Council for Harmonisation (ICH). ICH Harmonised Guideline: Guideline for Good Clinical Practice E6(R3). ICH; January 2025.
- BCS. Fit for the Future: A 10‑Year Health Plan for England. Br. Comput. Soc.; July 3, 2025.
Declaration of interests
I have read and understood the BMJ Group policy on declaration of interests and declare the following interests: none.