Can we use online search query data to predict the next syphilis outbreak?

Considerable stir has been generated by Young & Aral (Y&A) in a recent modelling study claiming to have demonstrated the potential of Google online search data to identify and predict syphilis outbreaks.

The application of digital technologies to the epidemiology of STIs, and of syphilis in particular, is nothing new (Simms & Petersen (STI)). The modern resurgence of syphilis in the late 1990s could be considered a child of the digital age, especially since the more recent arrival of GIS apps for MSM dating. The question has always been whether a technology that has contributed so much to the ill could also contribute to the remedy. As early as 1997 (when, in the UK, syphilis was just re-emerging) Patrick & Brunham (STI) offer the interesting case of a heterosexual outbreak in Vancouver, where mapping the outbreak, and characterizing the population and relational network structure led to the formulation of a tailored intervention strategy. Since then the development of GIS software and geo-spatial analytical techniques has better equipped the epidemiologist to predict the trajectory of an outbreak, and fine-tune their interventions (Simms & Petersen (STI)).

Regarding recent syphilis outbreaks, Petersen & Simms (STI) distinguish between endemic infection (such as find, in the today’s UK, in cities like London, Manchester and Brighton) which, once established seems remarkably resistant to interventions, and time-space clusters sometimes resembling point source outbreaks. Though, in the UK, these clusters only constitute a tenth of all diagnoses, they offer ‘unique intervention opportunities’. It is here that a timely analysis may enable us to predict the trajectory and respond in the most effective manner.

No surprise then that Young & Aral have developed their monitoring and prediction tool in application to syphilis outbreaks.

Over 2012-2014 they collected Google search query data around 25 keywords across 50 US states and related it to CDC surveillance data on primary/secondary syphilis, with a time-lag of one week. In each year, they trained their models over the first 10 weeks, then validated them over the remaining weeks. The models accurately predicted 144 weeks of syphilis counts for each state, with an overall average R squared of 0.9. (The co-efficient of determination (R squared) is the proportion of the variance of a dependent variable that is predictable from an independent variable, and will range from 0 to 1).

These findings would suggest that internet search data from Google Trends can indeed be used to predict syphilis outbreaks. Given the importance and difficulty of intervening in a manner that is timely as well as appropriate, the use of search query data would seem a very promising avenue. Amidst the present climate of hand-wringing over data misuse, this reminder of the potential benefits of online connectivity is itself timely and encouraging!

(Visited 325 times, 1 visits today)

BMJ Blogs