advertisement
On The Insider: Photo Gallery: Waxy Celebs
Find Articles in:
all
Business
Reference
Technology
News
Sports
Health
Autos
Arts
Home & Garden
advertisement

Content provided in partnership with
Thomson / Gale

Text analytics for life science using the Unstructured Information Management Architecture

IBM Systems Journal,  Sept, 2004  by R. Mack,  S. Mukherjea,  A. Soffer,  N. Uramoto,  E. Brown,  A. Coden,  J. Cooper,  A. Inokuchi,  B. Iyer,  Y. Mass,  H. Matsuzawa,  L.V. Subramaniam

The large scale sequencing of the human genome has greatly increased our knowledge of the genetic basis of biological processes and accelerated the pace of research and development aimed at treating disease and enhancing the health and well-being of humans. However, these advances also result in increased complexity in understanding and applying biomedical research and data. There is consensus in the life-science (LS) industry and academic laboratories that managing the complexity of biological data and knowledge requires an integrative, information-based systems approach, in which computer technology must play an essential role. For a cogent analysis of this situation and the role of computational methods in life science, see References 1-3.

Most Popular Articles in Technology
An overview of continuous data protection
Why all those current ratings?
Many countries now have a mobile penetration rate above 100%, report says
The Tata Group's big telecom gamble: VSNL's recent acquisition of Tyco ...
MEASURING BANK BRANCH EFFICIENCY USING DATA ENVELOPMENT ANALYSIS: MANAGERIAL ...
More »
advertisement

Key components of computational technology that are relevant to this effort include analyzing, searching, and mining biomedical text, and correlating the structured data derived from texts with data derived from biomedical experiments, transcribed medical records, and so on. This paper describes an IBM Research project to exploit and develop the text-analytical technology needed for managing, analyzing, and using biomedical text to solve problems in life science. We call the system BioTeKS for "Biological Text Knowledge Services." BioTeKS is also one of the first major systems implemented with the IBM Unstructured Information Management Architecture (UIMA), which is described later in this paper, in other papers in this issue, (4) and elsewhere. (5) This paper begins by describing the role and value of text analysis in LS research and development, and how BioTeKS fits into the broad range of technologies needed to manage text content. It then focuses in detail on the BioTeKS system specifically, and how BioTeKS is being used to explore text analysis, text search, and text mining to support problem solving in life science.

The role of text analysis in life-science research and development

Text analysis is a key component in text-oriented unstructured information management (UIM). The general goal of text analysis in UIM is to transform unstructured text information into structured information, and to use this information to support higher-level processes of text search, mining, and discovery. (For comprehensive reviews of UIM, see References 4, 6, and 7.) Transforming unstructured text into structured information means transforming "chunks" of text into specific, discrete data objects categorized or labeled by one or more attributes, where the data objects are words, phrases, or larger text segments. The essence of what the BioTeKS system does is information extraction (IE) for life-science text. Examples of IE include identifying names of biomedical entities, like gene, protein, and disease names (which may be expressed in multiword phrases), and identifying more complex facts about and relations between entities, such as interactions between proteins, genes, and the functions associated with them, or the correlations between drug effects and disease indications. Several overviews of text IE exist, in general, (8,9) and specifically for life science, (10-12) and we assume readers have some familiarity with the basic technical issues in IE.

The business and research value of extracting structured text information is that it can be used to solve problems in key biomedical domains and increase productivity in research and development. Text analysis can enhance general knowledge management practices and tools, for example, by improving the effectiveness (i.e., precision) of searching for documents in large collections, and by organizing these collections into taxonomy groupings or topic clusters for easier browsing. More importantly, text analysis can support knowledge discovery in various domains. Papers on knowledge portals and text-mining research in IBM can be found in recent special issues of the IBM Systems Journal. (6,7,13,14)

Figure 1 provides examples of text analysis phases in life science in relation to four key phases of drug research and development (see Reference 3). In the "Target Selection" phase for a drug (scenario 1), for example, a researcher needs to search scientific and patent literature to find out what drugs or diseases other researchers or institutions are working on, and what is already known and patented in this field. Knowledge discovery can increase the speed (and hence the productivity) of a drug researcher finding a drug target, a competitor's patent activity, or a participant in a clinical trial. In the "Preclinical" phase, researchers may conduct experiments that provide indications of relevant gene responses to drugs or disease agents. Scenarios 2, 3, 4, and 5 all pertain to finding literature describing aspects of genes and proteins that can help researchers investigate hypotheses about the relevance of these genes or gene products to some drug, disease, or biological process of interest. For example, studies have shown that literature references to genes can improve the search for gene homologies that may be relevant to identifying functions of novel target genes (15) (scenario 2), validate molecular pathways, (16,17) and help interpret why a cluster of genes might react together under some experimental conditions (18) (scenario 3).