Predicting HIV

We collected over 550 million geolocated tweets from Twitter and created an algorithm to find words and phrases suggesting HIV-related risk behaviors such as “sex” or “get high.” The algorithm captured 8,538 tweets indicating sexual risk behaviors and 1,342 tweets suggesting stimulant drug use. Using geolocation information, we mapped the origin of these tweets on a U.S. map to identify where these behaviors were occurring.
We then merged these tweets with county-level data on HIV cases (from to run statistical prediction models. We found a significant, positive correlation between county-level HIV prevalence and real-time communication about HIV-risk behaviors and drug use.


HIV risk Twitter

Overlay of county-level data.

This study provides the first evidence for how real-time social media data may be used for behavioral health prediction models. Moreover, it provides models for how public health departments and hospitals can use this approach to monitor and prepare for disease outbreaks.



National Institutes of Mental Health (K01 MH09884).