Brought to you by Adobe
- Adobe® Acrobat® 9 Pro Extended - a complete PDF solution
- Create interactive presentations
- Bring people & ideas together
- Communicate with impact
Featured White Papers
- Aug. 27th Webcast: The Power of Collaboration (BNET)
- Enterprise PBX buyer's guide (VoIP-News)
- Tools & Strategies for Expense Management (American Express)
Technology Industry
Industry: Email Alert RSS FeedIntelligent Forms Processing - technical
IBM Systems Journal, Sept, 1990 by R.G. Casey, D.R. Ferguson
A form is a conventional means for recording data, but it is seldom the final repository for data. Inevitably, at least part of the information represented on forms is transferred elsewhere in order to be recalled for later reference. Before the advent of computers, a bookkeeper's job in any industry consisted largely of filling in the columns of a ledger volume with data from transaction slips. In the computer age, the secondary destination is often a database system.
The increased use of computers has brought a continually decreasing cost for maintaining and using large volumes of data. Where the information is not generated by computer, or otherwise not accessible electronically, the labor-intensive task of feeding data into a processing system has come to account for a greater percentage of the overall cost. Eventually, perhaps, most data now entered on paper forms will be entered into databases directly at origination. Before the simple, conventional methods using paper forms can be replaced by electronic methods, the problems of user acceptance, data conversion between diverse systems, and legal requirements for record keeping must be overcome.
An image solution to capture the data can avoid some of these problems. Ideally this approach would permit minimal disruption in the way a business processes its transactions using paper records. As shown elsewhere in this issue, [1] many of the requirements for recording, maintaining, and distributing information on documents can be met using optically scanned representations. However, database searching and other machine processing of the document contents require that it be recorded as alphanumeric codes rather than as image. Thus we are led to consider methods for automatically interpreting the data on the form images in order to create corresponding records in a database.
The automatic encoding of character images is called optical character recognition (OCR). OCR originated to meet the need for high-speed input of data from billing statements and other documents particularly designed for data processing. To achieve the greatest accuracy and performance, special stylized print fonts have been developed and used in the printing of the form data. OCR has also been developed to a high capability in the reading of conventional machine-printed text such as typed pages or magazine articles. In Japan, where key entry is more difficult because of the thousands of different symbols used, considerable progress has been made in the OCR of hand-printed data.
If forms are specially designed for machine processing, and if the data are imprinted according to certain specifications, as in the case of credit card slips, then high accuracy can be achieved. The possibility of interference from background printing can be removed completely by printing the forms in a color such as red or green that can be made invisible to the scanner by the use of light-restricting filters. Redundant data, such as check sums, are sometimes printed along with the data in order to permit error detection and correction.
In some cases the forms cannot be redesigned for computer simplicity. Examples of this include archives containing forms from past years. Currently, the authors are pursuing a project involving approximately 40 million birth, death, marriage, and dissolution certificates that have been collecting in the state of California's archives for over 80 years. A key part of the project calls for transfer to a computer database of the data contained in these certificates.
We are also seeking to incorporate similar capability into the regular flow of data within an enterprise. Our system, Intelligent Forms Processing (IFP), a component of the state of California's Vital Records Improvement Project (VRIP), is intended to process data on forms designed according to current practices in form layout and usage. Figure 1 depicts an overview of the IFP. The system will accommodate misplacement of data in the fields, use of conventional print fonts, and the mixing of forms in batch processing. Optical scanning of forms provides input for display applications such as distribution, printing, and reviewing. By converting the data content to symbol codes and organizing it in a database, IFP permits conventional processing applications such as indexing, search, and retrieval based on content, sorting, update, statistics gathering, etc.
A variety of print styles is expected to be encountered. One major constraint in this area, in keeping with present capabilities in OCR, is the focus on reading machine-printed rather than hand-printed characters. If a breakthrough occurs in the OCR of hand printing (or even further in the future, in handwritten script), the general schema of this system can be directly extended to include these capabilities. At present the system does presume some previous knowledge of document typestyles or fonts. Most machine-printed forms are prepared on typewriters or electronic printers using basic print styles with the primary objective of portraying data clearly and legibly. The aesthetic considerations of document composition that are paramount in general publishing are absent, and so the myriad of type styles used in books, magazines, and newspapers is not encountered. In fact, a survey of several thousand documents stored in our initial application reveals that only a handful of fonts predominate. [2]
