22 Sep, 16 | by BMJ
I’ve been running the clinical search engine the Trip Database for nearly twenty years and as it has evolved, opportunities have arisen for working with other sectors or individuals with different perspectives. Recently this has involved the academic info retrieval world: a chance conversation with academics at the University of Glasgow, changed the way I looked at search. One really important notion I learnt about was clickstream data – the data websites collect of the user’s interaction with the site. In the case of Trip this data would equate to the search terms used and the articles users click on.
One thing’s for sure, Trip has lots of it. With a million searches per month (the vast majority being health professionals) we’ve amassed hundreds of millions of data points in the years since we started collecting it. This qualifies as big data. As with all big data projects the trick appears to be making sense of it all and that is a journey I feel we’ve only just started.
Let’s take an example where a user searches for ‘acne and minocyline’; we could infer that the user was interested in the effectiveness of minocycline in the treatment of acne. The user may well then click on the Cochrane systematic review ‘Minocycline for acne vulgaris: efficacy and safety’ which might reinforce any inference. In isolation they may have very limited value, but aggregate the search behaviour of hundreds of thousands of people and the results will reveal patterns of behaviour that offer an insight into the uncertainties of clinicians. If we saw just a single search for ‘minocycline and acne’ we might conclude that it was of limited interest to health professionals, perhaps just a random search. But if it was searched a hundred times in a month – it would probably suggest an area of significant interest and uncertainty.
Trip has been capturing the usage data (also known as clickstream data) since 2010 and has accumulated hundreds of millions of data points. But is this data useful? The first major analysis Trip undertook of our data was to map the articles users looked at in the same search session. The image below shows one example:
In the image we have selected searches for urinary tract infection and mapped the connections (connections being made when users click on the same article in the same session – the user links them based on their intention). As you can see the articles form distinct topic clusters. In the bottom left there is a clear cluster of articles on UTI and cranberry. It seems reasonable to suggest that these 19 articles form the core articles on the topic; all selected by Trip users. Equally interesting would be the ones they never clicked on; so which articles don’t appeal to our users? As ‘impact’ becomes an increasingly important concept having articles that people don’t look at should be equally as interesting as those that are clicked on.
More recently our journey/analysis has been boosted by our work with the Technical University of Vienna (TUW) as part of the Horizon 2020-funded KConnect project. The tools TUW have given us have allowed us to better understand the data. While a Twitter quiz might seem a superficial use of the data it proved insightful. The first quiz asked which skin condition do users search for most, given the options of acne, eczema and psoriasis. Over half thought the most popular search was acne, when in reality it was psoriasis. Similarly, when we asked which was most searched for out of influenza, measles, Zika and malaria the majority of users (80%) said Zika, when it was actually influenza. While the numbers were relatively small it shows that it’s not easy for individuals to guess where the uncertainties are. So, data-driven (‘evidence based’) analysis seems to be a useful tool in unearthing clinical uncertainties.
The data can be further analysed to show that when people search for influenza they most often also look for oseltamivir and the influenza vaccine. For Zika it was vaccine, embryopathy and imaging!
The data even allows us to explore topic areas to look for patterns and a recent sample of 4 weeks of data reveals that autism is the most popular topic looked for in the area of child health:
The data can be examined on a weekly basis, be it search terms or articles viewed. This allows to see topical trends. Something might have been a source of great uncertainty but subsequent new evidence or guidance might give the certainty the health professionals need and therefore reduce the need to search for answers. We have shown, for instance, similar trends as shown by the Google Flu Trends service – in that the number of queries we received closely matches the reported incidence of influenza.
Our initial interest was one of curiosity, does the way our users interact with the site show anything useful? We think it’s definitely interesting and potentially useful. For instance, we’re already discussing using this data with a large research funder to help in the prioritisation of new research. While neither side feel it give a definitive answer it’s probably another piece of the jigsaw in understanding uncertainties. As such it will compliment more labour intensive systems that are currently used.
Jon Brassey is the founder and director of the EBM search engine the Trip Database. In addition to this he works as lead for knowledge mobilisation at Public Health Wales, is an honorary fellow at the Centre for Evidence-Based Medicine, Oxford. Clickstream analysis is a developing area for Trip and in the future this may have some commercial interest.