Technology Industry
Industry: Email Alert RSS FeedGlossary extraction and utilization in the information search and delivery system for IBM Technical Support
IBM Systems Journal, Sept, 2004 by L. Kozakov, Y. Park, T. Fin, Y. Drissi, Y. Doganata, T. Cofino
As information technology (IT) advances, the number of products released and their associated documents increase at a rapid pace. Technical authors, and in particular authors of documents for product support, often use terms and words that are not found in general-purpose dictionaries. A situation may arise in which the meaning given to technical terms varies within the technical community.
Terminology frequently changes with the introduction of new products to the market. The terminology database in a technical support organization is perhaps one of the most frequently updated databases of this kind. The IBM Technical Support knowledge base, for example, contains specifications, problem descriptions, proposed solutions, and updates on thousands of hardware and software products.
- Most Popular Articles in Technology
- An overview of continuous data protection
- Why all those current ratings?
- Many countries now have a mobile penetration rate above 100%, report says
- The Tata Group's big telecom gamble: VSNL's recent acquisition of Tyco ...
- MEASURING BANK BRANCH EFFICIENCY USING DATA ENVELOPMENT ANALYSIS: MANAGERIAL ...
- More »
Glossaries can alleviate this problem. Glossaries help build a language common to people who search for information and people who author documents, thus increasing the effectiveness of search and retrieval systems. Because glossaries change rapidly and because of their large size, generating glossaries manually is costly. We describe in this paper the results of our investigation into automated glossary extraction. We also describe how we used an improved glossary extraction process to build and deploy a number of glossaries within the IBM Technical Support system used by customers. The business justification for building glossaries is to increase customer satisfaction when they use the IBM Technical Support Web site.
Technical documents pertaining to IBM products and services are processed, indexed, and stored in a master repository, known as the electronic support knowledge base (eSVd3) or the knowledge repository, which contains about a million documents in several languages. We have used eSKB as the corpus for our glossary extraction process, which integrates a number of tools and components into a complete solution.
The effectiveness of glossary extraction depends strongly on domain-specific resources, such as dictionaries, and also on the rules that generate labels or error codes, like APAR (authorized program analysis report) numbers or SOL (structured query language) errors. The glossary extraction processes, which are normally trained on general corpora like TREC (Text Retrieval Conference) (1) and do not take into account a specific domain, such as technical support, produce less useful glossaries for technical support applications. We found that domain-focused glossary extraction, where the term weights depend on document context, improves the effectiveness of the glossary.
In this paper we show several ways to improve the usefulness of the glossary and to make it more effective and robust for technical-support applications. To demonstrate the value of our approach we implemented Keyword Analyzer (KWA), an application that identifies salient terms in a document by using weighted terms from the glossary.
The rest of the paper is structured as follows. In the next section we present an overview of the architecture of the information search and retrieval system used by IBM Technical Support. Then, we summarize the approach to glossary extraction from Reference 2, which we will use as our starting point. In the following section we describe our implementation of the automated glossary extraction process for the technical support corpus and discuss the results we obtained. We observe that implementing the glossary extraction process without considering the specifics of the domain may lead to some erroneous results, and consequently, we present suggestions for improvement. Next we introduce KWA, the application we use to evaluate the effectiveness of our approach. We then propose the concept of a domain-focused glossary, in which glossary items are selected and ranked based on context, and we show some quantitative results from our tests. In this section we also discuss a possible application of the domain-focused glossary: the improvement of document-relevancy ranking in corporate search systems. We summarize our work in the concluding section.
Overview of the IBM Technical Support Enablement Architecture
The IBM Technical Support Enablement Architecture, whose implementation is nicknamed dBlue, is an advanced information search and delivery architecture for the Web-based system used by IBM Technical Support. (3) One of the goals of this system is to help customers find the desired information among the 2.5 million Web pages stored on the system. The dBlue system, which integrates effective technologies in storing, searching, and retrieving information, provides a set of user-oriented support services used by all IBM support sites.
The architecture connects three important types of elements from the information search world--information sources, search engines, and end users (see Figure 1). This is done through a set of components called the Knowledge Builder, which includes a content creation layer (blue blocks), a search management layer (green blocks) and a presentation management layer (red blocks). Information sources are any structured and unstructured data sources such as document repositories, DB2 and Lotus Notes databases, Web sites, and so forth. The first challenge of the architecture was to institute a consistent structure for content creation because the huge amount of support content that already existed was not well suited for searching. Then, of course, both existing and new content had to be migrated to this structure. The second challenge was determining how to store this information in a way that was scalable and flexible. The third challenge was how to retrieve it dynamically and efficiently. The main blocks of this architecture are shown in Figure 1. Content is extracted from the information sources using the Content Extractor and mapped to a unified XML (eXtensible Markup Language) schema. Then it is processed by the Content Processor and stored into the eSKB. The search management layer enables the connection between the Knowledge Builder and search engines. The Query Manager and Query Builder are responsible for processing search queries, collecting query-related parameters from the configuration management layer, and building the final search query. The presentation management layer provides several levels of customization, based on country, organizational unit, and individual user profiles. The View Builder constructs a customized view of search hitlists and documents. When a user requests a view of a specific document, this request is processed by the View Builder, which accesses the eSKB to retrieve the document content, and builds a coherent document view.