Wikipedia Page Visits Predict Disease Outbreaks

Scouring the internet in an attempt to self-diagnose an oncoming illness might not be so ridiculous after all.
That’s because researchers have developed an algorithm using Wikipedia page visits to predict future disease outbreaks 28 days in advance by drawing similarities between outbreak fluctuations and the number of visits to that disease’s Wikipedia page.
To do this, the study (which was conducted by Nicholas Generous and Sara Del Valle et al.) looked at 14 different disease-country pairs (influenza in the U.S., dengue in Brazil, and tuberculosis in Thailand, for example) and matched their trends to every Wikipedia page in the relevant language. The result showed similar patterns in the visits of 10 pages related to the disease.
“The general disease page was generally the one that correlated most strongly,” says Generous to Vox. “Also drugs and treatments, and, for the flu, different strains.”
This relationship was especially strong for dengue outbreaks in Brazil and the flu in the U.S. However, the correlation was insignificant in disease like HIV/AIDS (likely due to the smaller percentage of the population that suffers from it and the lack of dramatic fluctuation over time).
Google Flu Trends attempts to make these same predictions, but often overestimates CDC flu data due to problems in their private search algorithm.
What researches hope to accomplish with this is not only an accurate forecast for first-world countries like the U.S., but also to apply this system to countries with little public health knowledge.
“A global disease-forecasting system will change the way we respond to epidemics,” says Del Valle to Live Science. There are, however, issues in this proposed system.
One is Wikipedia’s method of grouping data by language, not by country. This isn’t overly problematic for countries like Poland or Thailand, where the majority of the language’s speakers are within borders. But it’s definitely confounding for English and Spanish speaking countries.
The other issue? The possibility of ignoring the data’s context. “With modeling, you sometimes see people over-trust the numbers,” Generous says. “It’s very important to understand the nature of the data you’re working with.”
For example, with the recent terror surrounding Ebola, traffic to Wikipedia’s page for the disease has skyrocketed, but this spike needs to be understood as fear of the disease rather than actual occurrences of it.
Generous believes that preliminary questions must be asked when reading and interpreting the data. “What are the biases in the data?” he asks. “What types of people are searching for their disease? What do these searches really mean?”
MORE: You Won’t Believe the Data Behind This Health Care Innovation

Can Wikipedia Really Reduce the Spread of Disease?