Big Brother? Using Twitter to gauge the health of a city
08 February 2013
Researchers have demonstrated how Twitter can be used to predict how likely it is for a Twitter user to become sick.
Twitter was used to model how other factors – social status, exposure to pollution, interpersonal interaction and others – influence health. Researcher, Adam Sadilek of the University of Rochester in New York State points out that a straight population survey to determine these factors would be rather expensive and time consuming. Twitter and the technology his team has developed does it passively, quickly and inexpensively. "We can listen in to what people are saying and mine this data to make predictions," he claims.
Moreover, many tweets are geo-tagged, which means they carry GPS information that shows exactly where the user was when he or she tweeted. Collating all this information allows the researchers to map out, in space and in time, what people said in their tweets, but also where they were and when they were there. By following thousands of users as they tweet and go about their lives, the researchers say they can also estimate interactions between two users and between users and their environment.
In a paper presented last Friday (February 8) at the International Conference on Web Searching and Data Mining in Rome, Sadilek showed how the new model accounts for many of the factors that affect health and how it can complement traditional studies in life sciences.
Using tweets collected in New York City over a period of a month, they looked at factors like how often a person takes the subway, goes to the gym or a particular restaurant, proximity to a pollution source and their online social status. They studied 70 factors in total, then looked at whether these had a positive, negative or neutral impact on the users' health.
Some of their results are perhaps not surprising; for example, pollution sources seem to have a negative effect on health. However, this is believed to be the first time this impact has been extracted from the online behaviour of a large online population. Sadilek's paper also revealed a broader pattern, where virtually any activity that involves human contact leads to significantly increased health risks.
For example, even people who regularly go to the gym get sick marginally more often than less active individuals. However, people who merely talk about going to the gym, but actually never go (verified based on their GPS), get sick significantly more often.
The researchers believe this has revealed interesting confounding factors that can now be studied on a larger scale.
The technology that Sadilek and his colleague Professor Henry Kautz have developed has led to a web application called GermTracker. The application colour-codes users (from red to green) according to their health, by mining information from their tweets for 10 cities worldwide. Using the GPS data encoded in the tweets the app can then place people on a map, which allows anyone using the application to see their distribution.
The app can be used by people to make personal decisions about their health - avoiding places where there is an indication that sickness is prevalent, for example. It might also be used in conjunction with other methods by governments or local authorities to try to understand outbursts of deseases such as influenza.
The model that Sadilek and his colleagues have developed is based on machine-learning. At the heart of their work is how they are training an algorithm to distinguish between tweets that suggest the person tweeting is sick and those that don't.
Sadilek likens it to teaching a baby a new language. They first generated a training set of data, 5,000 tweets that had been manually categorised and from which the algorithm can start to distinguish what words and phrases are associated with someone being sick. For example, the algorithm needs to distinguish between those who tweet: "I'm sick and have been in bed all day" (clearly, they are sick) and those who tweet "I'm sick of driving around in this traffic" (not a reference to state of health).
The application is also improving the algorithm. Every time someone goes onto the application and clicks on one of the coloured dots that represent the tweeting users, they can see the specific tweet that led someone to be classified in a specific way. The application asks you to assess the tweet yourself and say whether you agree with the classification or not. This gets fed back into the algorithm, which continues to learn from its mistakes.
The researchers have recently started two collaborations with colleagues at the University of Rochester Medical Centre. In one effort, they plan to link Twitter predictions to clinical influenza studies, and in another they are working with faculty in the Department of Psychiatry and the School of Nursing on extending these techniques to monitor and measure factors impacting depression and other psychological disorders.
Sadilek talks about his work in this YouTube video clip.
Although the work is centred on US cities, if you are interested, the app is hosted here.
Contact Details and Archive...