On CHOW: The right way to cook BACON
Find Articles in:
all
Business
Reference
Technology
News
Sports
Health
Autos
Arts
Home & Garden
advertisement
advertisement

Content provided in partnership with
Thomson / Gale

Text analytics for life science using the Unstructured Information Management Architecture

IBM Systems Journal,  Sept, 2004  by R. Mack,  S. Mukherjea,  A. Soffer,  N. Uramoto,  E. Brown,  A. Coden,  J. Cooper,  A. Inokuchi,  B. Iyer,  Y. Mass,  H. Matsuzawa,  L.V. Subramaniam

<< Page 1  Continued from page 17.  Previous | Next

Semantic text search. One of the most basic application functions is searching for documents, for example MEDLINE abstracts, using biomedical terms. The focus in BioTeKS on generating rich semantic annotations of terminology ha biomedical documents suggests the opportunity for text search to index not only keyword content, but also the semantic annotations associated with those keywords. That is, we want to explore the value of indexing and searching not only keywords corresponding to a specific gene name, such as "LMNA," but also indexing and searching annotations associated with these keywords, such as the category annotation "gene" associated with "LMNA." BioTeKS uses an IBM Research text search engine called Juru XML (54-56) to explore the indexing and search of annotations generated by BioTeKS annotators.

Semantic search can refer to many things, (8) but for our purposes it means simply indexing semantic information (in our case expressed as annotations) associated with keywords, and using this in the search process. Juru XML can index and search both keywords (based on text content) and semantic annotations of text keywords expressed as XML tags for keyword annotations, initially stored in CAS annotations. (54,56) The results of a Juru XML search are returned as a list of documents or document components, ordered by their relevance to the original query terms. For example, we can index all "sentences," the entity names contained within the span of these sentences, and other syntactic phrases (typically verb phrases) expressing biological functions of interest. This allows us to form queries on structures more closely approximating certain types of relations, such as protein-protein interactions.

For example, the following query, expressed in XML tags, will find sentences that contain keywords annotated as "proteins" and syntactic phrases that contain keywords annotated as "biological function" (typically verb phrases like "binds to" or "inhibits"):

<Sentence>

<Protein>SRV2</Protein>
<Function></Function>
<Protein></Protein>

</Sentence>

In this query, the user wants to see MEDLINE abstracts that contain sentences with a specific protein "SRV2," any other protein, and terms and phrases that have been annotated by BioTeKS as "biological functions." The specific annotator used for annotating sentences in this way is the Dictionary Lookup annotator (see Table 1), using dictionaries of protein names and biological function terms. This query does not ensure that the abstracts which are found will contain an actual protein-protein interaction, but pending evaluation, we believe such queries will greatly increase the likelihood of finding this combination of semantically annotated terms, and hence find actual interactions of interest.

Document clustering. Document clustering is a way to organize document collections (such as those derived from search results) in topical clusters. Clustering can complement text search or any other function that compiles collections of documents as part of an analytic process. We previously discussed an example of document clustering in the context of the Bio-Dictionary tool (12) (shown in Figure 2). The role of BioTeKS, as we indicated, is to extract text features, such as noun phrases, for input to the clustering-engine algorithm. These noun phrases are extracted using the shallow linguistic parser, and this annotator in turn uses tokenization and POS tagging annotators.