$200 million and two years could be shaved off a drug's development time by using informatics effectively.
Information and Knowledge Management (KM) are crucial components in any life sciences research and development strategy. Without an effective information management strategy, there is little point in doing biological research. This article focuses on the technology and guidance required to achieve good KM. We contend that the literature represents the largest source of information to be reaped, and also that a variety of software tools are needed to successfully integrate information from all relevant sources.
Christopher Larsen, Ph.D.
Provided here is an outline for establishing good KM practice, with discussions of how to integrate heterogeneous information concerning proteins, complexes, genes, compounds, and their interactions, modifications, mutations, and expressions. Potential pitfalls to these strategies are presented, along with companies offering products with solutions. There is some forward thinking involved because the current software suites only cover fractions of the totality of data.
The amount of information available to a life sciences researcher is extremely large, dynamic, and rapidly growing. The scientific literature alone is expanding at an astronomical rate, and advances in high-throughput technologies have significantly increased the amount of information that researchers must assess. They are faced with the daunting challenge of efficiently identifying, managing, and analyzing relevant scientific information from both public and proprietary sources.
Our scientific goal as informaticists is a complete conceptual unification of content — data, information, and knowledge. This content has a heterogeneous format or origin. It is in the literature, industrial assay output, research abstracts, conference meetings, and in personal communications. It is disseminated in various formats including raw text, email, web documents, traditional literature, patents, notebooks, spreadsheets, relational databases, marked-up text (XML/HTML), tables, and lists. It may address overlapping subsets of human macromolecules and metabolites and may not exhibit much formatting regularity.
Financially, there is good reason to use effective informatics. A successful informatics approach would keep data from being redundantly generated, make precise content searches possible, and allow rapid exchange of ideas in a research environment. One report from the Boston Consulting Group suggests that $200 million and two years could be shaved off a drug's development time by using informatics effectively.1 Other reports from Acumen mirror this conclusion, suggesting that 20 to 40 percent of total time spent on a typical proteomics project is wasted on searching for appropriate information.2
Given these problems, what must we do to successfully integrate a host of unrelated knowledge? Three prerequisites (at least) are needed to unite heterogeneous research data: 1) standardized vocabularies, ontologies, and systematic nomenclatures for describing biomedical research knowledge, 2) a relational database platform capable of accepting all the data, and 3) import tools such as scripts, application program interfaces (APIs), natural language processing (NLP), and data loaders that identify, capture, and integrate the data.
The resulting solution will help researchers share, access, import, standardize, write to, and link to data. Data cannot be well organized, searchable, and manipulatable if it exists in layers and formats without a common underlying structure.
Standard vocabularies are needed so the user can search, navigate content, or utilize high-level functionalities or metadata content. For example, two semantic terms for endoplasmic reticulum exist, but rough and smooth ER need to be separated conceptually, and a parent-child relationship needs to be built into a storage system for the two subtypes ("ER" is the parent, "rough ER" and "smooth ER" the children). The biologist understands the large difference between the ER functions-the computer does not.
In relational databases that accept such data, search terms also need to cross-reference molecular synonyms so they can avoid a somewhat schizoid data access. SwissProt is an excellent source of protein synonyms.3 By uniting different protein names such as Sentrin, SUMO, SUMO_1, SUMO-1, ySUMO, and others, database searches and content organization can be complete. These vocabularies are also necessary to mine unstructured literature data.
NLP approaches can be created with standard vocabulary syntax parsers that mine the literature with high speed and recall. Also, collection and archiving methods will fail without a strong vocabulary set.
A hierarchical and relational organization of vocabulary structures is called ontology. Ontologies are important elements of knowledge management projects. Many databases are GO4 compliant, that is, they use the GeneOntology consortium for a treed library of semantic terms that describe cellular phenomena such as molecular function, biological process, and subcellular localization (component). There are also standard nomenclatures or ontologies for cell types, tumor cells, organs, and diseases from Medical subject headings (MESH)5, the Cell Line Database (CLDB), online Mendelian Inheritance in Man (OMIM), and the American Type Culture Collection (ATCC)6.
In addition to standard nomenclature, a good informatics effort may require a virtual location for data storage. Using the nomenclature terms as selectable menus or table structures in a relational database, the groundwork has been laid for efficient warehousing. With standard terms the table values for the relational database are set, and then the data become searchable and related to the parent term: localization. The terms can be used as search terms for user tools, can be accessed by applications such as graphing or visualization tools, or can be used as selectable menu terms for collection and archiving. Typically, relational databases such as Oracle and MySQL can be created and probed using a sequence query language (SQL) to locate data in its bins.
If the target data are to be loaded into the resource at a rate faster than manual collection can supply, then import tools need to be designed. These might include Perl, Java, Python, or Ruby scripts; data loader applications; and conversion or wrapper utilities to migrate data from other resources such as Extract/Transform/ Load (ETL) tools that can convert the vocabularies from one database language system to another. This strategy is required if the data are broad and deep or occur in volume.
The largest problem with managing information is its scale and heterogeneity. To manage a large and variegated knowledge pool, an extensive integration project must accommodate at least three variables: scientific content, data format, and the software tools needed to bring information into a structured system. An effort to unify them under one vernacular might be termed "regularization" or "relationalization," and it can be the root of a successful project. Unstructured data need to be parsed, annotated, and stored; regularly formatted data need to be extracted, transformed, and loaded (ETL). The overall goal is to bring together unstructured (written text) language with well-structured regular data in tables and relational databases.
Figure 1. Hypothetical Breakdown of an Information Load on a Biological Researcher
There is a multiplicity of content formats. Generally there are structured and unstructured content sets. Structured data are both relational and nonrelational in the form of databases, spreadsheets, tables, lists, and delimited text. We will define structured data as completely structured if it is in relational database tables. Moderately structured sources such as XML have regularity to them in patent forms, computational linguistics artifacts, grammar, or other traits. Unstructured data are free text, patents, image and audio files, conference abstracts, marked-up text (HTML/XML), and most importantly the journal literature text and figures.
Data in various formats can only be integrated with a multifunctional suite of software solutions. For instance, data loaders for spreadsheets may not work on anything but regular ("computer format") data, and conversely natural language mining algorithms are best applied only to text strings. The technologies available for creating links between the AI and NI (artificial and natural intelligences) dictate what data can be addressed and what software crosstalk methods can be used.
The scientific guidance of an integration project dictates what can be accomplished. Any project has a human interest, a technological set of tools, a scientific focus, and a managerial mandate. Uniting them is key — efforts using a technology in search of a biological problem usually founder.
As an illustration of the different types and sizes of structured and unstructured data, Figure 1 shows a breakdown of information for a small biotech company. It is assumed that the researcher reads manuscripts, needs data from databases, and creates data in formats such as text, images, and spreadsheets, in addition to purchasing data from other vendors for a total of 1187 source documents.
Knowledge is built from the ground up. Just as a new house designed by an architect has a plan as well as parts, rooms, and an overarching structure, so also does the management of information. Technically we must build knowledge from bits, then text strings, database IDs, and table structures. Scientifically it calls for us to build up from sequence, structure, localization, modifications, mutation, function, and process, finally to theory and cellular function.
Table 1 outlines one conceptual arrangement of these ideas. We can subdivide our target content into data (e.g. 2d Gel spot 269b, 204.33 kDa), information (e.g. UBL domain from 119-190), knowledge (Cbl is probably a proto-oncogene), and content (all of the above things). This organization is present in a hierarchy, and different techniques must deal with its different layers of meaning7.
Table 1. A Holarchy of Informatics Problems, Vendors, Formats, and Solutions. Left to right: Reductionistic to Holistic level. Vendors are examples only and do not reflect a personal choice, technical superiority, or endorsement by the author.
Software approaches are quite powerful in dealing with large amounts of data and are becoming mission-critical for the life sciences researcher. The software-based shuffling of data text (as would be needed for sequences and other unidimensional data) is within the realm of the information technician's abilities. These efforts form the body of traditional bioinformatics and will not be dealt with here. Nonetheless the problems of bioinformatics are many, even though they are perhaps more direct to solve. Good solutions have been found with faster computers, better algorithms, more inclusive informatics APIs, and cross talk between applications that share low-level data such as database IDs, sequence, and transcript GO IDs.
Linguistics is a prime example of this "transcend and include" principle, and can be used as a parallel metaphor. In advancing from infancy to poetry, we must first expose ourselves to noise (awareness of the problem), gibberish (hypotheses), and then words (sequences). Words are strung into phrases (transcription units), which can be classified by their sound or meaning (subcellular distribution). When the vocabulary (tissue ontology) is finally developed and the information is handled correctly, we can begin to develop larger theories about the holistic zeitgeist (disease process) of our topic.
Only with the underpinnings of data and information can anything be generated that aspires to the power of an integrated approach to information management. Human minds do this automatically using interneurons and parallel processing, however designing a software system for doing so from the ground up is indeed difficult. We have learned as a community to deal with sequence, structure, database management, biological, and molecular functions. Now these form the basis of our ability to write some good poetry.
In total these strategies depend on comprehensive, flexible tools that can accept and mine content, and also on well-structured information to start with. Basic research and drug discovery are knowledge intensive, and rely on timely, accurate, detailed information from the primary literature. Unfortunately, the sheer volume of data is enough to discourage a solely human effort at integrating research knowledge. A database system with automated toolsets and useful import and analysis algorithms would help to alleviate this problem.
Who will pick up the torch of this problem? What efforts have already been made in the life sciences industry to acquire informational content and manage it well? Here are some answers.
The informatics efforts at NCBI8, EMBL9, and SwissProt10 are well known. A few companies concentrate on approaches based on high-throughput techniques, genomics, or data in a species or pathway niche. Excellent reviews of those resources are present in the industry or are available through online sites11. Corporations that have used or developed information management solutions extensively include Affinium Pharmaceu ticals12, Lifespan Biosciences13, Caprion Pharmaceuticals14, Incyte15, and Celera16. Kinexus Corporation17 has developed a huge phosphoprotein-kinase screening service. Lifespan has a set of protein localization data, while Caprion determines subcellular locations of proteins by using cell fractionation proteomics in normal and diseased cells.
There are many niche databases available. We estimate that there may as many as 5,000 biological databases. The journal Nucleic Acids Research publishes a listing of the top databases annually18. These databases are typically unidimensional, such as "all the mutations in human muscle proteins" or "SNPs from p53 in every organism." We believe that a complete solution should integrate all these under one application. Nonetheless, no "out of the box" solution exists yet for the total integration of these data19. Clearly, these data and a good knowledge management strategy would accelerate drug discovery in the very near term. All we need to do is improve the software to identify the most promising candidates.
Visual presentations of the interactions of cellular systems are an area of great interest. Some companies that offer interaction network managers are Cognia, Ariadne, Jubilant, Ingenuity, and Lion (Table 2). Cognia and Jubilant feature manual collection systems. The Ariadne system uses mining algorithms to generate networks from available abstracts and papers. Gene Network Sciences has taken a physics- based approach to visualize cellular interaction networks and has created its own notation for effects such as activate, inhibit, and repress. Ingenuity also has developed its Pathways Knowledge Base to solve some of these problems as well.
Table 2. Information Management Products and Companies
These applications are dependent on a few variables: data availability, data entry, information management, and user interface. Taken together, these present a difficult standard for software developers to agree on.
Let us look at an informatics solution set used by a small biotech startup. The approximate size is a terabyte of information. This is divided among structure, sequence, annotations, expression profiles, and other image banks. We may have twenty software applications to deal with them. There are many data formats, and many applications to access them. Unfortunately, no unifying software is available that can do all the things that are required by researchers. One must rely on a suite of applications as described here.
The National Institutes of Health is responding to these concerns and is currently attempting to implement a system that unifies publishers, scientists, and data archivists. There is a precedent from Los Alamos Laboratories when GenBank forced scientists to enter their sequence data prior to publication. This approach overcomes many hurdles and perhaps might only be attempted by an agency as large and respectable as the National Center for Biotechnology Information.
Only a few companies have created true KM software. Rather than address a single problem deeply, they attempt a broad approach that compiles all the data components and tries to generate whole bodies of knowledge. These are Lion, Ingenuity, Cognia, and IBM. The products attempt to bridge information management problems. Aside from databases and purchased content, other tools available to informaticians include custom-made scripts, APIs, import applications, batch data loaders, and ETL protocols.
Whole application suites have been attempted as stand-alone versions (Lion Biosciences) or as wrapper approaches (IBM's Discovery Link20). The IBM approach allows for strong sequence retrieval and querying from a variety of sources. However, because it does not integrate or warehouse data, it is somewhat more of a translator. Massive curation (collection and archiving) of literature knowledge has occurred in the yeast field (Yeast Proteome Database) and by small companies (Proteome before merger with INcyte). Larger efforts at data gathering have been reported (Ingenuity). Archival systems that integrate the collection of knowledge are used publicly in MySQL-based constructs, and they are now standardized in Cognia Molecular21 System. A larger, more profitable company, Invitrogen, has taken up the older efforts of Informax. Accelrys focusing on computational biology rather than knowledge integration, offers a strong suite of products that touch on many of the same data but is not curation driven and is a toolset rather than a content body.
Several companies offer software applications or suites in addition to informational content. One such example involves a database to house that information. It is an excellent destination for such data if they are robust, flexible, and relational. Basic choices for the integration of database-housed knowledge can be divided into federation technologies or warehousing methods. A classic early effort at database KM is the Proteome BioKnowledge Library, which was purchased by Incyte and now owned by BioBase. Cognia Molecular uses a large, standardized vocabulary and is GO-compliant to reflect the informational complexity of biology and chemistry. Cognia Molecular uses a scientific and archiving interface, whereas others only allow contract-driven literature collection as a service.
The available tools concentrate largely on regularized data, but the amorphous, large, and unstructured primary literature must feature heavily in any integration plan. This is the KM frontier. True KM will involve much more than writing a Perl script to extract annotations from spreadsheet tables or text files. No database alone will suffice; no single agency's work will be adequate.
Achieving real KM will involve a massive reconsideration of the entirety of the human vocabulary of science and our publication methods. It will grapple with epistemological questions about the nature of natural science, and will force us to record (in database form) a mirror image of our perception of life and knowledge. We will need vocabularies, database wrappers, ETL tools, natural language processors to relationalize the literature, and all the abilities of our coders and database administrators to help us store and effectively use it all. Most of all, these tools must output to a single user of knowledge, brimming with content of every sort.
Christopher Larsen, Ph.D., is senior biological programmer at Neuformatics, 215 S. Broadway #341,Salem, NH 03079, 347.837.0727, fax 801.459.8850, clarsen@neuinformatics.com
1. Tollman P, Altshuler J, Flanagan A, Guy P, Steiner M. (2001) A revolution in R and D: How genomics and genetics are changing the face of the biopharmaceutical industry. Boston Consulting Group, Boston MA. 26 Nov 2001. Available at:
www.bcg.com/publications/files/eng_genomicsgenetics_rep_11_01.pdf
.
2. Acumen Journal of the Sciences. Volume 1. Acumen is now reorganized and unavailable. See www.bio-itworld.com/archive/bases/052903.html for a brief sketch of the company.
3. Boeckmann B, Bairoch A, Apweiler R, Blatter MC, Estreicher A, Gasteiger E, Martin MJ, Michoud K, O'Donovan C, Phan I, Pilbout S, Schneider M. The SWISS-PROT protein knowledgebase and its supplement TrEMBL in 2003. Nucleic Acids Res. 2003; 31:365-370.
4. The Gene Ontology Consortium. Pathway databases: a case study in computational symbolic theories. Science 2001; 293(5537):2040-4. Available at: www.geneontology.org.
5. Nelson SJ, Johnston D, Humphreys BL. Relationships in Medical Subject Headings. New York. Kluwer Academic Publishing 2001. p. 171-184.
6. The American Type Culture Collection, ATCC. Available at: www.atcc.org/Home.cfm.
7. Wilber K. The spirit of Evolution: Sex, Ecology, Spirituality. Boston MA. Shambala Press 2001.
8. NCBI, National Library of Medicine. Available at: www.ncbi.nlm.nih.gov/entrez/query.fcgi.
9. European Molecular Biology Laboratory. Available at www.embl-heidelberg.de/.
10. Swissprot and TrEMBL. Available at: http://us.expasy.org/sprot/.
11. Bioinform URL www.bioinform.com and Bioneq URL www.bioneq.qc.ca/ for example.
12. Affinium Pharmaceuticals. Available at: www.integrativeproteomics.com/html/home/home.shtml.
13. Lifespan Biosciences. Available at www.lsbio.com/.
14. Caprion Pharma. Available at: www.caprion.com/index2.html.
15. Incyte Genomics. Proteome Bioknowledge Library and LifeSeq Foundation Products. Available at: www.incyte.com/login.html.
16. Celera. Available at: www.celera.com/.
17. Kinexus Bioinformatics. Phosphoprotein Screening Service. Available at: www.kinexus.ca/.
18. The Database Issue. Nucleic Acids Research 2005 January; 33(1).
19. Carel R. Practical data integration in biopharmaceutical research. Pharmagenomics 2003; 3(5):22-36.
20. IBM Corporation Discovery Link. Available at: www-306.ibm.com/software/data/dmsolutions/lifesciences/.
21. Cognia. Available at www.cognia.com/.