Featured White Papers
- Enterprise PBX buyer's guide (VoIP-News)
- Hosted CRM comparison guide (Inside CRM)
- Hosted CRM buyer's guide (Inside CRM)
Technology Industry
Industry: Email Alert RSS FeedText analytics for life science using the Unstructured Information Management Architecture
IBM Systems Journal, Sept, 2004 by R. Mack, S. Mukherjea, A. Soffer, N. Uramoto, E. Brown, A. Coden, J. Cooper, A. Inokuchi, B. Iyer, Y. Mass, H. Matsuzawa, L.V. Subramaniam
* In the first run, it took 40 hours to annotate the full MEDLINE corpus, using selected annotators and annotator translation applied to each MEDLINE abstract. Each MEDLINE abstract in XML format was parsed, terms and sentences were tokenized, and Dictionary Lookup was used three times to identify MeSH terms, expanded gene names, and gene function verbs. CAS consumers were used to translate annotations to an indexable format. An additional 120 hours were needed to create a Juru XML index (see "Semantic text search") from the MEDLINE abstract content, meta-data, and extracted annotations.
* In the second run, 360 hours were needed to annotate the full MEDLINE corpus, using selected annotators and annotation translation. Each MEDLINE abstract in XML format was parsed, terms and sentences were tokenized, and POS tagging, shallow parsing, and Dictionary Lookup for MeSH terms were applied. CAS consumers were used to create indexing input for text search (without actually creating a Juru XML index), and for creating noun-phrase feature files for on-demand clustering of document search results.
The difference between these two cases is the use of POS tagging and shallow linguistic parsing. It is generally the case that such processes are computationally intensive compared to other processes: for example, the current shallow parser can parse 14 documents per second, as compared to Dictionary Lookup, which can process between 200 and 1558 documents per second, depending on the size and complexity of the dictionary entries. However, indexing the entire MEDLINE corpus (or any other corpus) is likely to be done very infrequently, and once it is done, the indexes can be incrementally updated as new documents are added to the source collection. There are numerous parameters influencing processing throughput, and several optimization strategies are being explored.
Quality of BioTeKS entity and relation annotators. It is important to evaluate the quality of entity and relation extraction, but it is also quite difficult, given the state of the art in these technologies. Evaluation requires the manual definition of a test bed of accurately categorized entities or relations, against which the results of text annotations can be compared. This is difficult because of the inherent difficulty of unambiguously assigning meaning to language expressions (humans do not always agree on how to categorize a term or phrase), and because there are virtually no comprehensive test beds against which to compare performance (see Reference 10 for a discussion of this state of affairs). Nonetheless, the BioTeKS project is pursuing the evaluation of selected annotators. The following are some examples.
The BioAnnotator tool described earlier identifies "biomedical concepts" corresponding to noun phrases that contain one or more keywords in one or more UMLS thesauruses (including MeSH keywords). An evaluation against the GENIA test bed of 670 MEDLINE abstracts (44) produced standard precision, recall, and F-values (45) of 0.87, 0.94 and 0.90, respectively, for approximate matching (i.e., finding phrases that have any GENIA term in them; for a full report, see Reference 34). These are reasonable figures, although identifying more specific categories of terms, such as specific MeSH or UMLS terms, may be more difficult and is the subject of ongoing investigation.