Featured White Papers
- Webcast: Growing your business with CRM (BNET)
- Hosted CRM buyer's guide (Inside CRM)
- Hosted CRM comparison guide (Inside CRM)
Technology Industry
Industry: Email Alert RSS FeedText analytics for life science using the Unstructured Information Management Architecture
IBM Systems Journal, Sept, 2004 by R. Mack, S. Mukherjea, A. Soffer, N. Uramoto, E. Brown, A. Coden, J. Cooper, A. Inokuchi, B. Iyer, Y. Mass, H. Matsuzawa, L.V. Subramaniam
These annotators can be modified to work effectively on biomedical text. For example, standard POS taggers and shallow-parser rules are developed for sentence structures in generic text (e.g., news articles), and need to be modified for sentences and word-to-word statistical patterns characteristic of narrated physician reports or other kinds of biomedical text. In addition we should note that additional NLP annotators exist, including deep syntactic parsers. (30) However, we have not yet exploited these for life-science tasks.
Annotators for biomedical entity extraction. Entity extractors are annotators that identify the location of an entity name in a text and categorize the name relative to one or more knowledge resources like MeSH and UMLS ** (Unified Medical Language System), both developed by the National Library of Medicine. (31) Examples shown in Table 1 include entity extractors for identifying genes, MeSH terms (also used for manual annotation of MEDLINE abstracts), drug names, and chemical-compound names.
Figure 6 shows a debugging tool used in BioTeKS to annotate in color specific categories of entities in a MEDLINE abstract. The legend at the bottom identities semantic categories of words (e.g., "genes," "diseases," "chemicals and drugs"), and the data structure in the right frame is a CAS annotation for the selected term in the document. The annotation frame shows the CAS annotation for lamin A/C. Note that this is a variant of the gene LMNA, which appears in the title line (TI). Note also that the gene string is annotated relative to LocusLink, (32) a publicly available gene description database.
[FIGURE 6 OMITTED]
Entity extraction is a broad domain (see Reference 8), and there are multiple techniques available. BioTeKS is exploring three approaches to entity extraction, and Table 1 identifies for each annotator the general technique used to implement each annotator, namely:
* Pattern matching of terms (roughly string lookup) using a dictionary or database of known terms in some target category of terms (e.g., "MESH" terms, LocusLink-derived "gene" names, etc.)
* Rules defined over a set of features or annotations characteristic of a category of terms
* Machine learning, based on human-created training documents containing correct examples of some target category of terms and also based on features or annotations associated with the target category of terms
The strategy in BioTeKS is not to build annotators for every possible biomedical entity. This is an open-ended task that typically requires access to specialized text sources (e.g., medical records) and domain expertise. Rather, we have developed examples of general techniques for a set of representative entities typical of bioinformatics (e.g., genes and proteins), medical informatics (e.g., drugs and disease indicators), and patent mining (e.g., drug and chemical names, disease indicators). New annotators can be built by extending this set of annotators by adding new dictionaries, FST rules, or training documents in the case of machine learning. This process is best done with domain experts, who can provide more in-depth expertise necessary for evaluating the quality of entity identification and for inferring how to iteratively improve the quality of annotators by adding terms and synonyms to dictionaries, improving rules, or developing more accurate training documents.