On The Insider: Amy Winehouse Has Brain Damage?
Find Articles in:
all
Business
Reference
Technology
News
Sports
Health
Autos
Arts
Home & Garden
advertisement
advertisement

Content provided in partnership with
Thomson / Gale

Glossary extraction and utilization in the information search and delivery system for IBM Technical Support

IBM Systems Journal,  Sept, 2004  by L. Kozakov,  Y. Park,  T. Fin,  Y. Drissi,  Y. Doganata,  T. Cofino

<< Page 1  Continued from page 10.  Previous | Next

We use the following formula (equation 5) to calculate the modified domain specificity score (TD) and confidence level score (C) for the given term T (see also equations 2 and 3):

(5) TD*(T) = TD(T) x [[K.sub.D](T) + 1], [K.sub.D](T) = r/f[(T).sup.q]; C*(T) = a x TD(T) x [K.sub.D](T) + C(T).

The proposed modification of the domain specificity score is not applied to all the terms extracted from the given narrow domain collection. To automatically determine which terms can represent the given domain context, we proposed using external domain specific vocabularies, like the IBM Terminology dictionary (see Reference 16) for the given domain. The domain-specificity scores for the glossary terms, which appear in appropriate vocabularies, are increased according to formula (5). The boosting coefficient (K) in formula (5) is calculated based on the term's frequency (f) and two constant parameters (r, q), so that context-specific terms that have lower frequency in the given collection would get higher domain specificity. If some important domain-specific term appears only once in a certain document and does not appear in any other documents, the term is likely to be salient for this document. This value of the boosting coefficient K provides sufficient increase of the term's domain-specificity score to ensure the term will bc selected among top keywords for the given document. We call this mode of the domain-focused glossary a document view. We also considered another mode of the domain-focused glossary, where the boosting coefficient K is calculated based upon the following formula:

[K.sub.D](T) = r x [log.sub.2] [f(T)],

with one constant parameter (r), so that context specific terms that have higher frequency in the given collection would get higher domain-specificity scores. This mode may be useful for extracting bags of terms from narrow domain-specific collections of documents. We call this mode of the domain-focused glossary a category view.

The effectiveness of the domain-focused glossary-extraction process depends on the quality of the domain-specific vocabulary. The quality increases as the domain-specific vocabularies are made available to the glossary-extraction process.

Building multicontext glossaries. The process of building domain-focused glossaries introduces new architectural requirements. The final selection of the glossary items requires that selection should be based on the context of the document from which the items are extracted. The same glossary item may appear in different domain-specific glossaries with different confidence levels. Glossary administration tools are essential to manipulate multiple domain-specific glossaries and to establish links to domain-specific vocabulary entries.