Brought to you by Adobe
- Adobe® Acrobat® 9 Pro Extended - a complete PDF solution
- Create interactive presentations
- Bring people & ideas together
- Communicate with impact
Featured White Papers
- Hosted CRM buyer's guide (Inside CRM)
- Aug. 27th Webcast: The Power of Collaboration (BNET)
- Enterprise PBX buyer's guide (VoIP-News)
Technology Industry
Industry: Email Alert RSS FeedText analytics for life science using the Unstructured Information Management Architecture
IBM Systems Journal, Sept, 2004 by R. Mack, S. Mukherjea, A. Soffer, N. Uramoto, E. Brown, A. Coden, J. Cooper, A. Inokuchi, B. Iyer, Y. Mass, H. Matsuzawa, L.V. Subramaniam
Semantic text search. One of the most basic application functions is searching for documents, for example MEDLINE abstracts, using biomedical terms. The focus in BioTeKS on generating rich semantic annotations of terminology ha biomedical documents suggests the opportunity for text search to index not only keyword content, but also the semantic annotations associated with those keywords. That is, we want to explore the value of indexing and searching not only keywords corresponding to a specific gene name, such as "LMNA," but also indexing and searching annotations associated with these keywords, such as the category annotation "gene" associated with "LMNA." BioTeKS uses an IBM Research text search engine called Juru XML (54-56) to explore the indexing and search of annotations generated by BioTeKS annotators.
Semantic search can refer to many things, (8) but for our purposes it means simply indexing semantic information (in our case expressed as annotations) associated with keywords, and using this in the search process. Juru XML can index and search both keywords (based on text content) and semantic annotations of text keywords expressed as XML tags for keyword annotations, initially stored in CAS annotations. (54,56) The results of a Juru XML search are returned as a list of documents or document components, ordered by their relevance to the original query terms. For example, we can index all "sentences," the entity names contained within the span of these sentences, and other syntactic phrases (typically verb phrases) expressing biological functions of interest. This allows us to form queries on structures more closely approximating certain types of relations, such as protein-protein interactions.
For example, the following query, expressed in XML tags, will find sentences that contain keywords annotated as "proteins" and syntactic phrases that contain keywords annotated as "biological function" (typically verb phrases like "binds to" or "inhibits"):
<Sentence> <Protein>SRV2</Protein> <Function></Function> <Protein></Protein> </Sentence>
In this query, the user wants to see MEDLINE abstracts that contain sentences with a specific protein "SRV2," any other protein, and terms and phrases that have been annotated by BioTeKS as "biological functions." The specific annotator used for annotating sentences in this way is the Dictionary Lookup annotator (see Table 1), using dictionaries of protein names and biological function terms. This query does not ensure that the abstracts which are found will contain an actual protein-protein interaction, but pending evaluation, we believe such queries will greatly increase the likelihood of finding this combination of semantically annotated terms, and hence find actual interactions of interest.
Document clustering. Document clustering is a way to organize document collections (such as those derived from search results) in topical clusters. Clustering can complement text search or any other function that compiles collections of documents as part of an analytic process. We previously discussed an example of document clustering in the context of the Bio-Dictionary tool (12) (shown in Figure 2). The role of BioTeKS, as we indicated, is to extract text features, such as noun phrases, for input to the clustering-engine algorithm. These noun phrases are extracted using the shallow linguistic parser, and this annotator in turn uses tokenization and POS tagging annotators.
