The Web is an information source to most of us. But it’s also a dynamic, interactive medium, fluid as much in its substance as in its focus. In some ways, the same could be said of scientific data and the trajectory of research, especially in bioinformatics and genomics. As a constantly growing information repository and source, genomic data is constantly re-interpreted to increase our understanding of disease risks, pharmacogenomics, personalized medicine, and much more.
Sepandar Kamvar is no stranger to large amounts of confusing data. After all, he co-authored the book “We Feel Fine: An Almanac of Human Emotion”. Previous head of personalization at Google and currently on the technical advisory board of organizations as diverse as Etsy and NextBio, the assistant professor of Computational and Mathematical Engineering at Stanford University spoke with NextBio about the future of scientific information exchange.
NB: How did you get involved in designing algorithms that mine the biomedical literature?
SK: As an undergraduate, I was a chemistry major, and for my senior thesis I worked on a computational model for muscle contraction. That made me want to learn more about computer science and math, so I went to Stanford to get my Ph.D. in Scientific Computing and Computational Mathematics. At Stanford, my advisor was Chris Manning, who is an expert in natural language processing, and he got me interested in text mining.
NB: What do you see as unique information retrieval challenges in biological data right now?
There are lots, but one interesting challenge specifically related to biological information retrieval and text mining is that the language of biology is specialized and different from most natural language. One implication of this is that creating training data for machine learning algorithms is difficult and expensive. There’s a lot of space for developing algorithms that involve humans to inexpensively create training data for biomedical literature.
NB: Ranging from human emotions and biomedical literature to personalizing web searches for users, you’ve handled a wide variety of data. Are there similarities to the research questions with each?
One common challenge is scale; a lot of these applications involve warehousing and processing massive amounts of data, much more data than we were able to imagine, say, 15 years ago. This requires large-scale parallel processing frameworks with many commodity machines. A second common challenge is in the algorithms; many of these problems can be represented as machine learning problems, and algorithms for classifying biomedical data can be similar to algorithms for classifying social data. And finally, a lot of the design principles for visualizing the data are the same; for example, giving the user the ability to easily switch between different scales.
All of these systems—the web, Facebook, peer-to-peer music sharing systems, and peer-reviewed journals—are information networks. They have mechanisms for people to create, propagate, review, endorse, and remix information. In the past few years in biology (particularly in genomics), the ability to create information has increased tremendously, faster than the traditional journal system gives the ability to propagate, review, endorse, and remix. So tools inspired by (but different from) social networks will be useful for these tasks.
NB: We’re gradually getting more real-time with our social interactions online. How do you see this affecting scientific communication?
On the web, the available technology and the social norms have created a shift in the past few years towards transparency and collaboration. In the scientific domain, we will undoubtedly see a similar shift, although it would take longer. I imagine we’ll see more people-centered exchange of scientific information, alternative modes of scientific communication, and software and networks that make it easy to work with data produced by others.
NB: From a researcher’s standpoint, what interests you about the NextBio platform? Any aspects which are especially appealing to you?
It’s obvious that published data is incredibly valuable for further research, but right now there is a high bar to using it. Right now, individual scientists to have to find the appropriate data set, warehouse it, write scripts to put it in the appropriate format, clean the data of noise, and then do this for every data set for which they want to correlate their data. In practice, this means that many scientists are often discouraged from doing data analysis. NextBio’s platform takes care of that grunt work, making data analysis easier, which is very important.SHARE