On CBSNews.com: Can 365 Nights Of Sex Fix A Marriage?
Find Articles in:
all
Business
Reference
Technology
News
Sports
Health
Autos
Arts
Home & Garden
advertisement
Featured White Papers
advertisement

Content provided in partnership with
Thomson / Gale

Glossary extraction and utilization in the information search and delivery system for IBM Technical Support

IBM Systems Journal,  Sept, 2004  by L. Kozakov,  Y. Park,  T. Fin,  Y. Drissi,  Y. Doganata,  T. Cofino

<< Page 1  Continued from page 7.  Previous | Next

Another important observation is that the proportion of multiword terms among all the identified terms is far higher than expected. It is not comparable, for instance, to the corresponding proportion in the IBM terminology database, in which multiword terms account for 70 percent of all the software terms.

Analyzing these glossary statistics, we observe that the GlossEx utility has problems in recognizing multiword terms. Table 2 summarizes the observed failures of the glossary extraction process, the possible causes, and suggested solutions. The first three entries in Table 2 address different cases of multiword term recognition. The last entry in the table addresses the recognition of domain-specific abbreviations.

Evaluating the effectiveness of the glossary-extraction process

The glossary we build may not be effective in the search task unless it helps identify salient terms (keywords) in the context of interest. In this section we describe the process of extracting salient terms in a document and its implementation in KWA.

The goal of the KWA utility is to find instances of glossary items in the document. Because each item in the glossary is represented as a canonical form and associated variants, the goal is to find not just the instances of the glossary item's canonical form, but the associated variants as well. The approach used to match single-word and multiword canonical forms and their variants in a glossary file (the output of GlossEx) with words in a document is as follows.

* Inflectional variants match--KWA matches all inflected forms with lemma forms even though the inflected forms do not exist in the input glossary files.

* Case-sensitive/insensitive match---Users can control the case sensitivity in the match. When the "respect_case" option is set to "on" in the configuration, KWA performs an exact case match. Otherwise, it matches all case variations to the target word.

* Exact syntactic-category match--KWA matches only the words having the same POS category with the target glossary item. This option, which is always on for reducing many false positives, is especially important for finding instances of abbreviations. For instance, without this option, the abbreviation was for WebSphere Application Server would be interpreted as the verb "was" (past tense of "is") and thus would be matched with all variations of the verb 'to be."

* Biggest span match--This is an ad hoc procedure for selecting among overlapping spans. When KWA finds that more than one candidate matches with different spans for a glossary item, it selects the biggest span among all candidates.

* Abbreviation handling--Technical documents use many abbreviations, so a glossary file contains many abbreviations too. Some abbreviations are ambiguous (IBM is an abbreviation for "International Business Machines," "Intercontinental Ballistic Missile," and "Inclusion Body Myositis").

The matching rules for abbreviations are as follows. If an abbreviation is unambiguous (i.e., only one definition is found in the glossary file), then KWA always matches the abbreviation with the definition in the glossary file even though the definition does not appear in the text. For instance, when an unambiguous abbreviation, WAS, appears in a document without its definition, KWA returns "Web Application Server" as well as "WAS" as keywords. If, however, an abbreviation is ambiguous (i.e., more than one definition is included in the glossary file) and no definition is found in the document, no disambiguation is done; that is, KWA returns only the abbreviation as a keyword. If the abbreviation is ambiguous and one of the definitions is found in the document, KWA matches the definition with the abbreviation and returns both. If multiple definitions appear in the document, the closest definition from the abbreviation is linked to the abbreviation.