The main things we look for when examining a new diagnostic test are “Is it as good as, or better than our usual one”, “Is it quicker?”, “Is it cheaper?” and “It is easier for patients/less dangerous?”
While the latter three questions can be assessed by asking the folk who do the test, asking the managers who pay for the test, and undertaking an adverse effects systematic review, it’s the first of these that we tend to call “diagnostic test accuracy”, and as clinicians we want to look for “phase III” studies.
The premise of such studies is that we can evaluate how accurate a test is by comparing its results with that of a ‘reference standard’ – a thing by which we will judge if the patient really does, or really doesn’t, have the diagnosis in question* – in a group of patients in whom we want to know the answer.
Like all studies, these things can be subject to biases and errors, so you need to ask:
Appropriate spectrum of patients being studied?
If no, we can run into all sorts of problems (as described in the differentials blog, and illustrated with wee-based challenges).
Verification issues avoided?
Now this is a simple idea. If you look at your test result, it’s positive, and then don’t do the reference test (and call it a true diagnosis) you have, somewhat, proven that what you call a squirrel is a squirrel because you say it’s a squirrel. Same issue arises with negative tests leading to a lack of ‘proving’ the negative.
You might also have a reference standard that includes the thing you are testing: so if ‘pneumonia’ is X-ray changes with high resp rate, then you see ‘how good is a low resp rate at ruling out pneumonia’ … you get the idea…
(Technically these are partial / differential and non-independent reference standard verification biases.)
Interpretation of test and reference standard blind to each other or objective?
Along the lines of the need for blinded outcomes in therapeutic trials, if you know what the result of one diagnostic bit is you might be prompted to an answer. (Have you ever seen some subtle collapse/consolidation on a CXR after hearing right basal creps, that others just weren’t skilled enough to notice?)
Diagnostic thresholds reproducible, pre-defined or derived?
Marginally more tricky, but not really a complex idea … if you don’t know what made someone call the test ‘positive’ (e.g. how squelchy is the squelch sign for non-appendicitis?), then the test becomes unusable.
On more statistical lines, if the threshold for ‘positive’ on the test (e.g. d-dimers >10,000) is set before doing the study, you can believe it more than one that took all the data, and then drew a ‘cut-off’ where it made the test look best.
This then becomes the AVID way of appraising diagnostic test accuracy papers … like RAMBo and FAST.
– Archi