Brought to you by Adobe
- Adobe® Acrobat® 9 Pro Extended - a complete PDF solution
- Create interactive presentations
- Bring people & ideas together
- Communicate with impact
Featured White Papers
- Enterprise PBX comparison guide (VoIP-News)
- 5 Strategies for Making Sales the Engine for Growth (AchieveGlobal)
- Hosted CRM comparison guide (Inside CRM)
Technology Industry
Industry: Email Alert RSS FeedText analytics for life science using the Unstructured Information Management Architecture
IBM Systems Journal, Sept, 2004 by R. Mack, S. Mukherjea, A. Soffer, N. Uramoto, E. Brown, A. Coden, J. Cooper, A. Inokuchi, B. Iyer, Y. Mass, H. Matsuzawa, L.V. Subramaniam
Entity extraction using dictionary lookup works well when domain experts can enumerate the names of entities of interest (as well as variants of these names). However, this is not always possible. The second approach to entity extraction is based on rules, not enumeration, where the rules can be developed by domain experts, either directly or by using machine-learning methods. The context in which a name occurs can sometimes provide feature cues for the semantic type of name for a term. Where this is the case, it is sometimes possible to write rules based on these features. The ChemFrag and Drug-Dosage annotators are examples of annotators based on rules.
The ChemFrag annotator combines regular expression rules that recognize organic chemical names with rules that assemble these fragments into larger descriptions. A small dictionary of prefixes and suffixes is used in some of the rules. An example of a recognized chemical fragment is Pivaloyloxymethyl 1-ethyl-1, 4-dihydro-4-oxo-7-(4-pyridyl)-3-quinolinecarboxylate. Note that identification in this case only means categorizing chemical names as "chemical names," and does not mean identifying the specific chemical name in a standard resource (e.g., Reference 35). Rules are formal expressions, such as, "A fragment contains balanced parentheses or brackets and possibly, numbers and hyphens." ChemFrag is a hybrid consisting mainly of rule classification based on features, augmented with a small dictionary of known prefixes and suffixes.
The DrugDosage annotator (see Table 1) is also a hybrid annotator. (36) In DrugDosage, dictionaries are used to identify known drug names (e.g., "Ibuprofin") and strings associated with quantities and dosages (e.g., the quantity "20," and the abbreviation "mg" for milligrams). Rules classify co-occurrences of drug names, quantities and dosage abbreviations (e.g., "ibuprofen, 20 mg") as "drug and dosage" concepts. The rules for identifying these concepts use a modified version of the FST engine used in the TAF shallow parser. However, instead of compiling and using English language syntax rules, the rules for drugs and dosages are modified to apply to noun phrases that contain patterns of drug names, quantities, and dosage modifiers. Writing rules manually requires expertise and iterative refinement to achieve satisfactory levels of accuracy.
The BioTeKS team is also exploring machine learning (ML) approaches to entity extraction. ML approaches are especially useful when neither dictionaries nor explicit rules are easy or possible to build. ML approaches consist of a training phase involving the creation by humans of a training corpus, providing true examples of a category of entity to be learned (e.g., drug or gene names). The ML process automatically builds a classification model of the target category of term based on features associated with terms in the document context. These may be features of the term itself (e.g., distinctive characters, substrings, prefixes, or suffixes) or features of the linguistic and semantic context in which entity names occur.
