Research Projects

Who shares personal health information on the Internet?

Representation & Quality of Digital Data for Health Research

There are a number of systems that use data from the Internet (such as, news, social media and crowd-sourced reports) and other digital sources (e.g., cell phones, wearable devices) to monitor disease spread, assess population attitudes towards vaccines, and improve understanding of the interaction between population behavioral changes and health. In addition to challenges in extracting public health signals from the noise inherent in these data sources, there are significant biases due to differences in the representation of individuals from different locations, age and race/ethnic backgrounds. Although there have been several publications discussing the limitations of these data sources, no project has developed a rigorous and comprehensive approach to systematically investigate these limitations and explore mitigation strategies.

Understanding the strengths and limitations of these data sources and systems would enable a rigorous assessment of its usefulness for public health research and potentially increase acceptability by public health practitioners. Additionally, assessing the representativeness and quality of these data at the United States county level would be extremely important. Information on health outcomes is typically available at the state or country level, and insufficient sample sizes at finer geographical resolutions makes it difficult to assess health needs such as chronic disease prevalence, quality-of-life measures, and important determinants of health.

To this end, we will develop a process to assess the comprehensiveness and quality of data for public health research at the US county level. To illustrate this approach, we will focus on social media data, which is widely used in the measurement of various health outcomes, and surveillance of disease. In order to improve population health, we need to understand current health and disease trends. This can be partially achieved by evaluating the quality of data used for public health research

Collaborators: Christan Grant at the University of Oklahoma; Kadija Ferryman at Data and Society, and Quynh Nguyen at the University of Maryland

Funding: Robert Wood Johnson Foundation

Data Science for Social Good – Unsafe Foods & FDA Recall

The goal of this project is to use product reviews from Amazon.com to identify potentially unsafe food products. Foods that are mislabeled, contaminated, or spoiled get recalled through a time-consuming process that can leave consumers at risk of allergic reactions, injury, and illness for months. Our goal is to use reviews that consumers post online to predict whether a product will be recalled.

Project details and code on Github

Foodborne Illness Surveillance

The aim of this project is to develop a framework for monitoring foodborne illness reports using data from sources such as, Twitter and Yelp. We work with public health departments to develop tools for monitoring local reports of foodborne illnesses for targeted restaurant inspections and surveillance of foodborne disease outbreaks. This project is run in collaboration with the Computational Epidemiology Group at Boston Children’s Hospital.

Funding: R01 grant from the National Library of Medicine, National Institutes of Health.

Integrating Digital Data with Other Data Sources for Infectious Disease Surveillance and Forecasting

Researchers have developed computational and mathematical models to capture determinants of infectious disease dynamics and identify factors that support prediction of these dynamics, provide estimates of disease risk, and evaluate various intervention scenarios. While these studies have been extremely useful for the understanding of infectious disease transmission and control, most have been disease specific and solely used data from traditional disease surveillance systems. In contrast, there is a huge amount of internet-based data that have been extensively assessed and validated for public health surveillance in the last decade, but it has been scarcely used in conjunction with other data sources for modeling to predict disease spread. Using these novel digital event-based data sources in combination with climate and case data from traditional disease surveillance systems, we will establish a much needed framework for integrating these disparate data sources for modeling to estimate disease risk and forecasting temporal dynamics of infectious diseases. Our approach will be achieved through three aims. The first objective is to develop an automated process for acquiring, processing and filtering data for modeling (Aim 1). Once we gather this data, we will develop temporal models for the dynamical assessment of the relationship between the various data variables and infectious disease incidence (Aim 2). Finally, we will assess the utility of the modeling approaches developed under Aim 2 for forecasting temporal trends of infectious diseases (Aim 3). Through data acquisition, thorough processing, statistical and epidemiological modeling, and guided by advisers with expertise in biomedical informatics, computer science and statistics, we plan to achieve a comprehensive approach to integrating multiple data streams for modeling to forecast infectious diseases.

Mentors: John Brownstein at the Harvard Medical School, Al Ozonoff at Harvard Medical School and Madhav Marathe at Virginia Tech

Funding: BD2K K01 from the National Institutes of Health

Previous Research Projects

Influenza Forecasting

My PhD dissertation was focused on developing a method for forecasting influenza epidemics using network epidemic models. I also worked on influenza surveillance using novel data sources during my postdoc. See below for a list of relevant publications.

Computational approaches to influenza surveillance: beyond timeliness

A Dirichlet process model for classifying and forecasting epidemic curves

A systematic review of studies on forecasting the dynamics of influenza outbreaks

Forecasting Peaks of Seasonal Influenza Epidemics

Using Clinicians’ Search Query Data to Monitor Influenza Epidemics

Monitoring Influenza Epidemics in China with Search Query from Baidu

A Simulation Optimization Approach to Epidemic Forecasting