Brought to you by Adobe
- Adobe® Acrobat® 9 Pro Extended - a complete PDF solution
- Create interactive presentations
- Bring people & ideas together
- Communicate with impact
Featured White Papers
- 5 Strategies for Making Sales the Engine for Growth (AchieveGlobal)
- Hosted CRM buyer's guide (Inside CRM)
- Technology-based learning: Extending reach & ensuring Leadership Development effectiveness (SkillSoft)
Technology Industry
Industry: Email Alert RSS FeedGlossary extraction and utilization in the information search and delivery system for IBM Technical Support
IBM Systems Journal, Sept, 2004 by L. Kozakov, Y. Park, T. Fin, Y. Drissi, Y. Doganata, T. Cofino
A more important filtering is pre-modifier filtering (a pre-modifier precedes the term described). Many pre-modifiers in noun phrases, even in domain-specific noun phrases, act as general-purpose modifiers rather than representing domain-specific information. For instance, a pre-modifier remote in glossary item remote server is domain-specific, but new in new server is not viewed as such.
The easiest way for filtering non-domain pre-modifiers might be to keep a "stop-word" list and remove all pre-modifiers in the stop-word list from candidate glossary items. However, some modifiers are domain-specific in one domain but general in others. Thus, we automatically decide whether a premodifier should be filtered based on the domain-specificity (D) of the pre-modifier and the association (A) of the pre-modifier with the noun it modifies. The domain-specificity of a pre-modifier a is computed by relative probability of the occurrences of the word in a domain corpus d and in a general corpus g; that is, D(a) = [p.sub.d](a)/[p.sub.g](a). The association of the pre-modifier with the head noun (n) is calculated by the conditional probability of the head noun and the modifier; that is A(n, a) = p(n|a).
Glossary-item aggregation. The same concept may appear in text in a number of different variations, such as misspellings or abbreviations. We attempt to identify all conceptually identical expressions of a candidate glossary item and aggregate them into one glossary item, so that they can be treated by applications as one.
GlossEx currently identifies and aggregates inflectional variants, orthographic variants, compounding variants, misspellings, and abbreviations. We select one of the forms as the canonical form and make the other forms its variants. The aggregation step also combines the frequencies of the different forms so that glossary items with many variant occurrences may be assigned higher confidence values.
* Inflectional variants: singular-plural forms and different tenses (human-performance criterion and human performance-criteria)
* Orthographic variants: glossary items with special characters such as hyphens or dashes (audio/visual input and audio-visual input)
* Compounding variants: compounding form and lexicalized form (passenger airbag and passenger air bag)
* Misspelling variants: correct spelling and misspelling or alternative spelling (accelarator and accelerator, nitroglycerine and nitroglycerin)
* Abbreviations: abbreviated form and full form (R1H and radial first harmonic)
Note that GlossEx currently does not perform deep semantic processing, and thus it cannot identify synonyms nor handle polysemous glossary items. Instead, we provide a GUI (graphical user interface) tool for users to manually add or aggregate synonymous glossary items for their applications.
Glossary-item ranking and selection. Having obtained candidate glossary items, we rank them before selecting the final set. We decide the goodness of each term based on how much an item is related to the given domain, its domain specificity, and the degree of association of all words in the item's canonical form (hereafter called term cohesion). The confidence of a term T, C(T), is defined by equation (2).
