Brought to you by Adobe
- Adobe® Acrobat® 9 Pro Extended - a complete PDF solution
- Create interactive presentations
- Bring people & ideas together
- Communicate with impact
Featured White Papers
- Sept. 11th: PCI DSS therapy for the smaller retailer (McAfee)
- Don't miss this enterprise mobility Webcast! (TechRepublic)
- Enterprise PBX buyer's guide (VoIP-News)
Technology Industry
Industry: Email Alert RSS FeedText analytics for life science using the Unstructured Information Management Architecture
IBM Systems Journal, Sept, 2004 by R. Mack, S. Mukherjea, A. Soffer, N. Uramoto, E. Brown, A. Coden, J. Cooper, A. Inokuchi, B. Iyer, Y. Mass, H. Matsuzawa, L.V. Subramaniam
UIMA also specifies various services and methods for collection-processing management, resource management for linguistic and other resources needed by annotators (e.g., dictionary or ontology resources), and CAS consumer methods for translating CAS annotations to other forms of data that can be indexed and stored (see Figure 4). Neither UIMA nor BioTeKS have functions for accessing document collections in the sense of crawling. (Figure 4 shows document access and management as external to the text-analysis component). However, BioTeKS does adopt the UIMA collection-processing scheme for implementing "collection reader" functions, for feeding documents from a compiled collection into a text-analysis engine. A collection reader parses each input document and initializes a new CAS structure containing the initial flat text on which annotators will operate, as well as optional document meta-data (e.g., title, author, etc.). BioTeKS has a collection reader specialized for MEDLINE abstracts, and a reader for aspects of patent documents. In both cases, the documents are initially available as XML documents with labeled fields for document-level meta-data, such as, "title," "author," and "date," as well as fields fur extended segments of text containing the contents of the documents (e.g., MEDLINE abstract, patent abstract or claims, etc.).
NLP annotators. BioTeKS includes the following generic NLP text annotation methods:
* LanguageWare * linguistic engine
* Part-of-speech (POS) tagger
* Finite state transducer (FST) with shallow parsing syntax rules
The LanguageWare linguistic engine (24) segments text into tokens and sentences, using a specific text an notation model. The LanguageWare tokenizer combines dictionary lookup with algorithmic processing to segment input text into distinct lexical units. (25) LanguageWare dictionaries also contain additional lexical information that can be associated with the lexical items identified as part of segmentation, such as a word's lemma or part of speech. This lexical information is useful for the subsequent annotation processes, including disambiguation of POS tags.
The POS tagger and FST component annotators are research-enhanced annotators based on an earlier text analysis engine developed by IBM Research, called Textract, which was available in the IBM Intelligent Miner * for Text (IM4T) software product. (26) (Textract components are described in more detail in Reference 27.) These two annotators, along with the LanguageWare tokenizer, use a common "text annotation framework" (TAF), also described in Reference 27. TAF specifies a set of CAS annotation types appropriate for multiple levels of linguistic processing, for example, tokens (strings), terms (including multiword phrases), sentences, clauses, and so forth, and properties of these linguistic objects, such as the character location of the span of annotated text in a document, part of speech for the span, and so forth. Most BioTeKS annotators interoperate through this annotation type system, or through annotation types derived from this set.
