Featured White Papers
- PCI DSS therapy for the smaller retailer (McAfee)
- Hosted CRM comparison guide (Inside CRM)
- Enterprise PBX buyer's guide (VoIP-News)
Technology Industry
Industry: Email Alert RSS FeedGlossary extraction and utilization in the information search and delivery system for IBM Technical Support
IBM Systems Journal, Sept, 2004 by L. Kozakov, Y. Park, T. Fin, Y. Drissi, Y. Doganata, T. Cofino
Example: consider a search system with 500000 documents indexed, where
1. 1000 documents contain term T2--"WAS" (WebSphere Application Server);
2. 10 documents among them contain term T1--"NoResourceException."
Assume that there is one document (A) that contains two occurrences of term T1 and 10 occurrences of term T2, and there is one document (B) that contains one occurrence of term T1 and one occurrence of term T2.
A user submits the following query: NoResourceException in WAS. The system calculates scores of documents A and B for terms T1 and T2, based on the TF-IDF formula, as follows:
Score_A(T1) = 2 x [log.sub.2](500000/10) = 30.2;
Score_A(T2) = 10 x [log.sub.2](500000/1000) = 89.7;
Score_B(T1) = 1 x [log.sub.2](500000/10) = 15.6;
Score_B(T2) = 1 x [log.sub.2](500000/1000) = 9.
According to these scores, the search system puts document A at the top of the hitlist, and document B goes to the bottom. Now, let us assume that document A contains a customer problem report, and document B contains a link to the patch that should be applied to resolve the problem reported by the customer. In a traditional search system the user, most likely, will never open document B because it does not appear at the top of the hitlist, or even on the first page of the hitlist.
The proposed approach changes the way the search system calculates document relevancy scores by introducing context-dependent weights of search terms. The scores for given search terms are calculated based on weights assigned to these terms in accordance with their salience in the given context. If in the above mentioned example the document scores were calculated based on weights assigned to terms T1 and T2 in the context of WAS runtime exceptions, then both documents A and B would have similar scores, because WC(T1) >> WC(T2), where WC(T) is the weight of the term Tin the given context C.
To assign appropriate weights to query terms the proposed method uses domain-focused glossaries that are created based upon the corpus of documents, using available corporate taxonomies and specialized technical vocabularies. As shown in the section "Domain-focused glossaries," in a domain-focused glossary each term's score is calculated based on the term's salience in the given context. For example, the term NoResourceException has a high confidence level in the context of WAS runtime exceptions, but may be ignored in the context of was product sales, whereas the term WAS is more salient in the latter context.
Conclusion
In this paper, we discussed the need for enhancing traditional glossary extraction processes for IBM's technical support corpus, the methods we used, and the results we achieved. Doing so increased the usefulness of the search and delivery system for IBM's technical information as well as customer satisfaction with the IBM support site. This also improved the tools we use for managing multidomain glossaries.
Central to this work, we introduced the idea of domain-focused glossaries. We discussed the importance of tuning the process of generating glossaries to the specificity of the corpora. We introduced the method of biasing or focusing a glossary to the context of the domain from which the glossary is extracted. This process requires building dictionaries of domain-specific terms. The method of biasing comprises modifying the weights of the glossary terms consistent with the domain context information.