Creating centralized databases for biological information and developing tools for efficient access

In 2013, 1.1 million new scholarly articles in the life-sciences were added to the National Library of Medicine's PubMed database. How can scientists assimilate these exploding amounts of scientific data and knowledge? Put simply, they can't. To realize the full potential of scientific knowledge, the information from those thousands of traditional publications should be coalesced into computable biological knowledge bases. Dr. Peter Karp of SRI International is working to create a modern digital infrastructure for biological knowledge that captures knowledge in both a computable form, and in the form of online scientific encyclopedias. Furthermore, because a computational model of an organism can be developed using a pathway knowledge base for that organism, such knowledge bases not only save scientists large amounts of time by providing concise presentations of information distilled from thousands of publications, but also allow computers to undertake large-scale systematic analyses of these data that are beyond the capacity of the human brain. Dr. Karp is making biological data more centralized, organized, and most importantly, computationally accessible, to accelerate science.

As the Director of the Bioinformatics Research Group at SRI International, Dr. Peter Karp has established two highly-used knowledge bases whose contents were distilled from tens of thousands of scientific articles.  For example, his MetaCyc database describes 2,200 metabolic pathways and 12,000 metabolic reactions distilled from 43,000 publications. MetaCyc is part of a larger collection of 5,500 knowledge bases for organisms with sequenced genomes. The knowledge bases use artificial intelligence techniques to capture data that have been manually distilled from a large number of English-language publications. The distillation is performed manually by biologists using a process called curation because it is impossible for computers to accurately assimilate information in English-language form.  The distilled information in Dr. Karp's databases is much faster for scientists to digest, and, more critically, a computer can perform large-scale analyses of the data in the distilled knowledge-base form that Dr. Karp's team creates.

Dr. Karp is also developing new computer methods for accessing, visualizing, and analyzing those knowledge bases. For example, one software tool predicts the components of the cellular biochemical factory within an organism based on its sequenced genome. A second tool generates computational metabolic models from pathway knowledge bases. Due to the complexity of biological systems, Karp's group develops computer-visualization tools to present large volumes of information to scientists in graphical forms that can be comprehended quickly. As well as researching novel computational methods in bioinformatics, his team runs a production shop that delivers the software and databases to the scientific community via a vigorously used website that is operational on a 24-7 basis.

Some of the specific software and databases developed by Dr. Karp's group include the following:

  • Dr. Karp's EcoCyc database integrates the complete Escherichia coli (E.coli) genome with information from 27,000 scientific publications to form a highly-used computable knowledge base and online encyclopedia for the organism best known to science.  The EcoCyc website receives 170,000 visitors per year and has been cited in publications more than 2,700 times. The computational metabolic model derived from EcoCyc predicts the lethality of E.coli gene knock-outs with 95% accuracy.

  • In the 1960's, scientists began creating manually-drawn metabolic pathway charts that summarize the the biochemical reactions occurring within a living cell. However, manually drawn metabolic charts are tedious to produce and did not accurately depict the metabolism of any one organism. Thus, they became quickly outdated. Dr. Karp has developed algorithms that automatically generate metabolic charts for each of the 5,500 knowledge bases in BioCyc. These charts -- a kind of Google Maps for the cellular biochemical factory -- reflect data taken from thousands of sources, and reflect up-to-date knowledge about a specific organism. Seen at the link (http://biocyc.org/overviewsWeb/celOv.shtml), each of the roughly 1000 lines is a metabolic reaction that interconverts several chemical compounds. The shape of the diagram is not as important as its connections. One can move a cursor over the lines and identify a specific chemical and learn how it is synthesized.

  • Computational metabolic models simulate a cell's biochemical factory. Similar to how an oil refinery takes in crude oil for processing and produces various output products (such as gasoline, kerosene, and lubricating oils), cellular metabolic pathways intake external chemicals and convert them into a variety of chemicals necessary for cellular processes. Dr. Karp's models predict relative rates of metabolic pathways--chemical reactions within a cell.. These rates tell scientists how quickly a cell is producing different chemicals, and at what rate the cell is extracting energy from external components. If for example, scientists wanted to add a new pathway to a cell for synthesizing a biofuel from chemicals found in the cell's environment, they need to know critical information about the cell's metabolic pathways: what natural pathways are competing against the biofuel pathway, how can scientists shut down those pathways, and how can they direct more biofuel precursors for creating more of that fuel? Dr. Karp's metabolic modelling helps guide engineering changes to the cell's biochemical factory.

For many years Dr. Karp's two separate passions -- biology and computers -- were isolated from one another, and he could not identify a way to bring them together. He was very excited when he joined Stanford as a graduate student and became part of a team that was applying computer science to molecular-biology problems. Whereas many scientists must decide between one of two disciplines, or even specialize very narrowly to be successful, Dr. Karp has had the opportunity to take a broad view that allows him to look at how the parts of an entire biological system interact to produce its overall function. He can dissect and analyze many aspects of molecular biological knowledge while also building software and databases that accelerate the work of thousands of scientists.

While in graduate school, Dr. Karp attended a conference in Santa Fe, New Mexico centered on the newly-developed field of bioinformatics. The conference was unique in many ways; whereas typical conferences are 2-3 days, this particular one was a 6-week working conference where the attendees would listen to presenters during the morning and spend the afternoon working on projects. Because the field was then very young and underdeveloped, Dr. Karp had previously felt isolated in the field, and the conference presented an opportunity for deep immersion in the field while working alongside interesting colleagues. It gave him the opportunity to connect with like-minded people: some of whom he is still in contact with today. Dr. Karp also drew inspiration from Harold Morowitz, a senior researcher who presented at the conference about the origins of life and metabolic pathways. Additionally, Santa Fe was an inspiring setting that combined natural beauty, southwestern food, and a distinctive culture. The conference confirmed to Dr. Karp that he was pursuing the right field by broadening his horizons and exposing him to other things that were occurring in the field.

Dr. Karp has had the coveted ability to partially influence the development of the bioinformatics field due to his entrance during its early years. While attending the aforementioned conference, Dr. Karp envisioned the creation of a database describing the complete metabolic machinery for one organism. Within 10 years, his group achieved that goal for E.coli. Once genome sequencing became available and got progressively cheaper, the bioinformatics field took off, and discoveries were being made at a much more rapid pace. Currently, computer methods are used to predict metabolic machinery based on an organism's genome sequence. Leveraging the E.coli knowledge base, his group has now generated pathway knowledge bases for 5,500 organisms, ranging in complexity from bacteria all the way to humans.

Fellow, American Association for the Advancement of Science

Fellow, International Society for Computational Biology, 2012

Fellow, SRI International, 2008

Phi Beta Kappa, Summa Cum Laude, 1982

University of Pennsylvania