After an optical character reader initially identifies individual characters in a character array by matching their images with dictionary patterns, its character recognition system assigns upper and lower labels to the characters according to their highest and lowest positions relative to the character array. If these assigned values contradict with the labels preassigned to the identified characters, corrections are made accordingly and certain frequently occurring types of errors are checked in terms of these labels. upper and lower case letters which are shaped similarly and certain similarly looking symbols can thus be correctly identified.

Patent
   4860376
Priority
Mar 04 1987
Filed
Mar 04 1988
Issued
Aug 22 1989
Expiry
Mar 04 2008
Assg.orig
Entity
Large
16
9
EXPIRED
1. In a character recognition system for an optical alphanumeric character reader for recognizing characters by reading a pictorial image, extracting therefrom a line image, extracting therefrom character images of individual characters, and matching said character images with dictionary patterns, the improvement wherein said system comprises
label setting means for determining relative positions of the highest and lowest positions of individual extracted characters in a character array, assigning an upper label and a lower label to each of said individual extracted characters according to said highest and lowest positions, correspondence between each of alphanumerical characters and symbols and its upper and lower labels being predefined, correcting, if the result of initial character identification of a character in said character array by said optical character reader contradicts with said upper or lower label assigned to said character, said assigned upper or lower label to the labels corresponding to said recognized character, and storing said corrected labels and result of recognition in a memory means, and
judging means for checking whether there is a contradiction between said corrected labels stored by said label setting means and the results of initial character identification by said optical character reader regarding a preselected set of characters and symbols, and correcting, if there is such a contradiction, said results of initial character identification according to said upper and lower labels in said contradiction.
2. The character recognition system of claim 1 wherein said preselected set of characters and symbols include alphabetic letters with similarly shaped upper and lower case figures, the period, the comma, the hyphen, the apostrophe and the quotation mark.

This invention relates to a character recognition system for an optical alphanumeric character reader and more particularly to such a system for correctly identifying letters in upper and lower cases and certain symbols.

Roman alphabet includes letters of which the upper and lower cases are shaped similarly to each other such as "C" with "c" and "S" with "s". Some symbols look alike such as "," and "'" while some symbols are easily misidentified as two different symbols such as """ and an arrow. Such characters and symbols cannot be identified easily by an optical character reader merely by matching their images with patterns in a dictionary memory. Prior art systems recognize characters and symbols by determining detection lines from a histogram taken in the horizontal direction from an extracted line as shown in FIG. 12 or by obtaining a threshold value on the basis of line height. If there are no clearly visible valleys in the horizontal histogram as shown in FIG. 13 or if an optimum threshold value cannot be determined from the line height as shown in FIGS. 14A-14C, however, it is actually a very difficult problem to correctly recognize characters and symbols.

It is therefore an object of the present invention to provide a character recognition system for an optical alphanumeric character reader for correctly recognizing similarly shaped upper and lower case letters and various other symbols.

The above and other objects of the present invention are achieved by providing an optical character reader with a character recognition system including label setting means and judging means. After characters are initially identified from individual character images extracted from a line, upper and lower labels are assigned to each of the extracted characters in a character array according to their highest and lowest positions relative to that character array but corrections are thereafter made if these assigned labels contradict with the labels preassigned to the identified characters. Next, certain frequently occurring types of errors are checked in terms of these labels such that characters with similarly shaped upper and lower case letters and sets of similarly looking symbols can be correctly identified.

The accompanying drawings, which are incorporated in and form a part of the specification, illustrate an embodiment of the present invention and, together with the description serve to explain the principles of the invention. In the drawings: FIG. 1 is a flowchart of character recognition according to the present invention,

FIG. 2 is a block diagram of an optical character reader incorporating a character recognition system embodying the present invention,

FIG. 3 is an example of binary image of a word,

FIGS. 4 and 5 show initially assigned label values for the binary image shown in FIG. 3,

FIG. 6 shows an example of reassignment of label numbers according to the present invention,

FIG. 7 shows an example of correction between upper and lower case letters according to the present invention,

FIGS. 8A and 8B show examples of identification of a period and a comma,

FIGS. 9A and 9B show examples of identification of an apostrophe,

FIGS. 10A and 10B show examples of identification of a comma and an apostrophe,

FIG. 11 shows an example of identification of a hyphen,

FIG. 12 shows a prior art method of obtaining detection line from a histogram in the horizontal direction,

FIG. 13 shows a situation where no clear valley appears in a histogram of a line in the horizontal direction, and

FIGS. 14A, 14B and 14C show examples where an optimum threshold value cannot be set from the line height.

In what follows, the present invention is explained by way of an example with reference simultaneously to FIG. 1 which is a flowchart of an operation embodying the present invention and to FIG. 2 which is a block diagram of an optical character reader incorporating the present invention. After a document is placed on a glass table, it is read (scanned) by a line sensor of a scanner 1 and the scanned pictorial image of the document is converted into binary data by an analog-to-digital converter. The line sensor is driven in the secondary scanning direction and the entire document is scanned (Step S2). The binary data of the pictorial image scanned by the scanner 1 are temporarily stored in an image buffer 2, from which a line is extracted by a recognition control unit 3 containing a microprocessor and the extracted line is stored in a line buffer 4 (Step S2). Next, the recognition control unit 3 extracts individual characters from the binary patterns stored in the line buffer 4 and stores them in a single character buffer 5. At the same time, the positions and the sizes of these characters are measured (Step S3) and the recognition control unit 3 extracts their characteristics and stores teem in a single character characteristics buffer 6. A recognition unit 7 then identifies the characters by matching these characteristics with pattern stored in a dictionary memory 8 (Step S4). In this process of identifying the character, the recognition control unit 3 serves to identify characters and symbols having similar shapes by using the relative positional relationships and sizes of extracted characters and assigning position labels to each character as will be explained more in detail below.

To start, the recognition control unit 3 assigns an upper label and a lower label respectively in its upper and lower label buffers to each of the characters in an extracted word by considering their relative positional relationships, their sizes and the results of the initial recognition in Step S4 (Step S11). For each extracted word, the highest and the lowest positions of its characters are determined as U={u1,u2, . . . un }and D=[d1,d2, . . . dn }where n is the number of characters in the word and the highest position Mu and the lowest position Md are determined respectively from U and D as illustrated in FIG. 3 for a scanned character array "computer,". Next, a threshold value T is defined, for example, by (Md -Mu)/7 and "2" is set, as shown in FIG. 4, in the upper and lower label buffers corresponding to the character, or characters for which the condition |ui -Mu |<T or |di - Md |<T are satisfied (where i=1,2, . . . n). Average upper and lower positions hu and hd are then calculated from the remaining highest and lowest positions u and d and the same threshold value T as defined above is used to determine the values of ui and di which respectively satisfy the conditions |ui -hu |<T or |di -hd |<T where i=1,2, . . . n. Numerals "1" are set, as shown in FIG. 5, in the upper and lower label buffers corresponding to the highest and lowest positions ui and di which satisfy either of the above inequalities. If there is a contradiction between the labels thus assigned and the results of the initial character recognition by the recognition unit 7 from Step S4, labels are reassigned according to the results of recognition (Step S13). FIG. 6 shows an example of this step. In this example, all four characters in the word "talk" is assigned a lower label number of "2" in Step S12 but these four characters as recognized in Step S4 should all be assigned "1" as lower labels. This contradiction is removed by changing all four lower labels from "2" to "1" as shown in FIG. 6.

After the labels are thus assigned, the results of recognition are examined and, if necessary, corrected. Firstly, it is known that upper and lower cases of some letters are similarly shaped such as between "0" and "o" and between "V" and "v". For such letters, the labels for the lower case letter are often (1,1) while those for the upper case letter are often (2,1) (upper label appearing hereinafter first inside parentheses). This relationship is used to make corrections between upper and lower cases (Step S14). An example of this type of correction is illustrated in FIG. 7 wherein an upper case character "0" with labels (1,1) is corrected to its corresponding lower case character "o".

Secondly, it is known that "," and "." are difficult to identify. Of the characters with labels (0,1) or (0,2), those with the highest position below the center line of a character having labels (1,1) such as "e" and "a" are identified as "," or "." as shown in FIGS. 8A and 8B (Step S15).

Thirdly (Step S16), if there is a character with labels (2,0) and having its lowest position above the center line of a character having labels (1,1), it is identified as "'" and if there are two of them next to each other, they are recognized as """. If a character is originally identified as "," but its labels are (2,0), it is corrected to "'" as shown in FIG. 10A. Similarly, if a character is originally identified as "'" but its labels are (0,2), it is corrected to "," as shown in FIG. 10B.

Fourthly (Step S17), in the case of a character with labels (0,0) having the highest and lowest positions as shown in FIG. 11, it is identified as "-" if the ratio of its length to its height is 3 or greater, and as "." if this ratio is less than 3. After corrections and identifications described above are completed, the identified results are converted into the JIS (Japanese Industrial Standard) code (Step S18).

With a character recognition system of the present invention, it becomes possible to correctly identify alphanumeric characters and symbols which have been difficult to identify by prior art methods depending only on matching with dictionary patterns. The examples used above in the description of the present invention are intended to be illustrative and not limitative. Modifications and variations which may be apparent to a person skilled in the art are included within the scope of this invention.

Konya, Minehiro, Katsurada, Morihiro, Tanka, Hideaki

Patent Priority Assignee Title
5048113, Feb 23 1989 RICOH COMPANY, LTD , NO 3-6, 1-CHOME, NAKAMAGOME, OTA-KU, TOKYO, 143 JAPAN, A CORP OF JAPAN Character recognition post-processing method
5212739, Oct 17 1990 Hewlett-Packard Company Noise tolerant optical character recognition system
5237627, Jun 27 1991 Hewlett-Packard Company Noise tolerant optical character recognition system
5257328, Apr 04 1991 Fuji Xerox Co., Ltd. Document recognition device
5265171, Nov 28 1990 Kabushiki Kaisha Toshiba Optical character reading apparatus for performing spelling check
5369715, Apr 27 1990 Sharp Kabushiki Kaisha Optical character recognition system
5381489, Aug 10 1988 Nuance Communications, Inc Optical character recognition method and apparatus
5455871, Nov 19 1991 Xerox Corporation Detecting function words without converting a scanned document to character codes
5548700, Dec 29 1989 Xerox Corporation Editing text in an image
5657403, Jun 01 1992 Cognex Corporation Vision coprocessing
5717794, Mar 17 1993 Hitachi, Ltd. Document recognition method and system
5729630, May 14 1990 Canon Kabushiki Kaisha Image processing method and apparatus having character recognition capabilities using size or position information
5734761, Jun 30 1994 Xerox Corporation Editing scanned document images using simple interpretations
5940583, Nov 15 1994 Canon Kabushiki Kaisha Image forming apparatus
6014459, Nov 15 1994 Canon Kabushiki Kaisha Image forming apparatus
6256408, Apr 28 1994 International Business Machines Corporation Speed and recognition enhancement for OCR using normalized height/width position
Patent Priority Assignee Title
2905927,
3295105,
3651459,
4024500, Dec 31 1975 International Business Machines Corporation Segmentation mechanism for cursive script character recognition systems
4075605, Sep 13 1974 RECOGNITION INTERNATIONAL INC Character recognition unit
4491965, Dec 16 1981 Tokyo Shibaura Denki Kabushiki Kaisha Character recognition apparatus
4558461, Jun 17 1983 Litton Systems, Inc. Text line bounding system
4674065, Apr 30 1982 International Business Machines Corporation System for detecting and correcting contextual errors in a text processing system
4727588, Sep 27 1984 International Business Machines Corporation System for automatic adjustment and editing of handwritten text image
////
Executed onAssignorAssigneeConveyanceFrameReelDoc
Mar 04 1988Sharp Kabushiki Skaisha(assignment on the face of the patent)
Jun 27 1988TANAKA, HIDEAKISHARP KABUSHIKI KAISHA, OSAKA, JAPAN A CORP OF JAPANASSIGNMENT OF ASSIGNORS INTEREST 0049140372 pdf
Jun 27 1988KATSURADA, MORIHIROSHARP KABUSHIKI KAISHA, OSAKA, JAPAN A CORP OF JAPANASSIGNMENT OF ASSIGNORS INTEREST 0049140372 pdf
Jun 27 1988KONYA, MINEHIROSHARP KABUSHIKI KAISHA, OSAKA, JAPAN A CORP OF JAPANASSIGNMENT OF ASSIGNORS INTEREST 0049140372 pdf
Date Maintenance Fee Events
Aug 11 1992ASPN: Payor Number Assigned.
Feb 12 1993M183: Payment of Maintenance Fee, 4th Year, Large Entity.
Feb 10 1997M184: Payment of Maintenance Fee, 8th Year, Large Entity.
Mar 13 2001REM: Maintenance Fee Reminder Mailed.
Aug 19 2001EXP: Patent Expired for Failure to Pay Maintenance Fees.


Date Maintenance Schedule
Aug 22 19924 years fee payment window open
Feb 22 19936 months grace period start (w surcharge)
Aug 22 1993patent expiry (for year 4)
Aug 22 19952 years to revive unintentionally abandoned end. (for year 4)
Aug 22 19968 years fee payment window open
Feb 22 19976 months grace period start (w surcharge)
Aug 22 1997patent expiry (for year 8)
Aug 22 19992 years to revive unintentionally abandoned end. (for year 8)
Aug 22 200012 years fee payment window open
Feb 22 20016 months grace period start (w surcharge)
Aug 22 2001patent expiry (for year 12)
Aug 22 20032 years to revive unintentionally abandoned end. (for year 12)