Making use of unused data
«No one is using these data – this has to be changed», Atul Butte, Institute for Computational Health Sciences, San Francisco
Science is generating vast amounts of data, and more and more of them are readily available in open access databases. Take for example Pubchem: this database contains information about nearly 10 million chemical compounds, 1.2 million of which have been tested for their activity in bioassays. Or the various repositories for gene expression: They contain data from 2 million DNA microarrays, each of them showing expression profiles for every single gene in the genome. All of this information is publicly accessible for anyone.
There is just one problem: «No one is using these data», says Atul Butte, inaugural Director of the Institute for Computational Health Sciences at the University of San Francisco. Most of the time, scientists have a specific research question that they are trying to answer with an experiment. They use the data they generate to answer their question – and that’s it. So these data are just used once. «This has to be changed», says Butte.
For him, science has to follow a new approach: «The data are already there, right at our hands. Now we can start to ask: What are the questions we can answer with them?».
Butte calls this approach «data recycling», and he is following it in his own research. As a medical doctor and computer scientist, his aim is to use data mining for better insights into diseases and to find possible treatments. For example, his lab is developing computational methods to find new uses for existing drugs – something called drug repositioning–, by relying on data available from databases. Making use of this method he and his fellow researchers found that a drug originally used to kill parasitic worms might be effective against liver cancer. Or that tricyclic antidepressants could be used to treat certain forms of lung cancer. Some of the repositioned drugs have only been tested in animal experiments so far, whereas others have made it into clinical trials. Butte emphasizes that the coasts of his procedure to identify a drug candidate amount to under $1 million, as compared to $10 million to $1 billion a pharmaceutical company typically invests to bring a new drug to market.
His strategy has already proven to be successful. A few years ago, Butte used a combination of available microarray data with proteomics to identify new blood biomarkers for preeclampsia. This condition associated with very high blood pressure during pregnancy is a major cause of death in pregnant women and their unborn babies. Based on his findings, Butte co-founded a spin-off, Carmenta Bioscience, that now offers a diagnostic test for preeclampsia. The company has already been sold, not even 24 months after the initial study started.
Butte and his colleagues are now working on a goal that seems even more ambitious than previous efforts. They want to evaluate and combine the data from electronic health records to predict disease progress, drug response, and ‘next disease’ prediction in individual patients. The researchers are planning to integrate electronic health records from five medical centers with records from more than 15 million patients. Funding is already guaranteed: a few months ago, Mark Zuckerberg and Priscilla Chan donated $10 million to support Butte’s projects.