Featured White Papers
- Oct. 14th: Simplified IT with Software-as-a-Service (SaaS) (ZDNet)
- PCI DSS therapy for the smaller retailer (McAfee)
- The rise of Web commuting (Citrix Online)
Technology Industry
Industry: Email Alert RSS FeedText analytics for life science using the Unstructured Information Management Architecture
IBM Systems Journal, Sept, 2004 by R. Mack, S. Mukherjea, A. Soffer, N. Uramoto, E. Brown, A. Coden, J. Cooper, A. Inokuchi, B. Iyer, Y. Mass, H. Matsuzawa, L.V. Subramaniam
The Dictionary Lookup annotator uses a "dictionary" of entity names, each categorized in relation to a knowledge resource such as MESH. For example, in addition to a dictionary of MeSH terms, we developed a dictionary for gene names, compiled from the public LocusLink database. (32) These dictionaries are stored as XML files that include the canonical form of biomedical entity names, as well as lexical variants, and other information, such as references to the source (database, authority, or "ontology") and an identifier of the entity name in that source.
The annotator applies pattern matching to match tokens in the text (potential names) to each item in the dictionary. The matching process is more than simple lookup because it can also handle variations in case and morphology (e.g., "Trk A" or "Trk-A"), including stemming (e.g., the common "stem" underlying plural vs. singular forms of a word).
The Dictionary Lookup annotator is actually several annotators, each specialized for specific dictionaries. Dictionaries can be built from MeSH and UMLS resources available from the National Library of Medicine (33) (YLM), as well as publicly available databases specialized for specific biomedical entities like proteins (Swiss-Prot (21)) or genes (LocusLink (32)). Note that many pharmaceutical and biotechnology companies have also developed internal and proprietary dictionaries of terms, including variants and synonyms. In general, dictionary tools need to be able to incorporate new dictionary resources, and the Dictionary Lookup tool can do so, using other dictionaries when they are properly formatted.
The BioAnnotator (see Table 1) categorizes noun phrases provided by a shallow parser as biomedical phrases when they match terms in UMLS, either as complete matches or as partial (substring) matches. BioAnnotator uses the LanguageWare linguistic engine described earlier to identify UMLS terms. This is done by replacing the default English language dictionary in the engine with a dictionary based on UMLS. BioAnnotator also has a rule-based component to identify biological terms not present in UMLS and to resolve certain types of ambiguity in extracted gene names (e.g., some gene names are also used to name proteins or nonbiological entities, such as "BIKE"). (34) Dictionary Lookup and BioAnnotator are annotator options that have overlapping functions, but also explore different entity identification methods.
The Term Categorizer annotator elaborates on the semantic context of identified terms. For example, MeSH terms are categorized in a hierarchical taxonomy. This annotator optionally associates identified terms with additional information such as synonym and cross-reference information. Upstream applications can use this information in various ways, for example, to create a navigation function for browsing and selecting terms. This function could be bundled with Dictionary Lookup, but because it is an optional level of annotation, it is appropriate to keep it separate and invoked only when needed.