A scanner scans a group of documents. For example, the documents can be a group of invoices. The documents are received and processed. objects (e.g., a text object, such as a word) and their locations are identified in each of the documents. occurrences of similar objects in the identified locations between the documents are determined. A document sorting algorithm is applied to generate a score for each of the documents. The score for each of the documents is generated based on a number of occurrences of similar objects between the documents. The generated score of each of the documents is used to identify a template document. The template document is then used to cluster the documents.
|
1. A method comprising:
scanning, by an electronic scanner, a plurality of documents to produce electronic representations of the plurality of scanned documents;
receiving, by a microprocessor, the electronic representations of the plurality of scanned documents;
for each of the electronic representations of the plurality of scanned documents, identifying, by the microprocessor, a plurality of objects and physical locations of each of the plurality of objects within the electronic representations of the plurality of scanned documents;
determining, by the microprocessor, occurrences of objects in same identified physical locations of each of the plurality of objects between the electronic representations of the plurality of scanned documents, wherein the objects comprise at least one of: a same text, a same letter, a same word, a same picture, a same logo, a same phrase, a same number, a same capitalization, a same case, and a same punctuation mark;
applying, by the microprocessor, a document sorting algorithm to generate a score for each of the electronic representations of the plurality of scanned documents, wherein the score for each of the electronic representations of the plurality of scanned documents is generated based on a number of occurrences of objects in the same identified physical locations between the plurality of scanned documents; and
comparing, by the microprocessor, the generated score of each of the electronic representations of the plurality of scanned documents to identify a template document, wherein the template document is one of the plurality of scanned documents.
11. A system comprising:
an electronic scanner configured to scan a plurality of documents to produce the electronic representations of the plurality of scanned documents;
a document processor comprising a microprocessor configured to receive the electronic representations of the plurality of scanned documents, for each of the electronic representations of the plurality of scanned documents, identify a plurality of objects and physical locations of each of the plurality of objects, determining occurrences of objects in same identified physical locations of each of the plurality of objects between the electronic representations of the plurality of scanned documents within the electronic representations of the plurality of scanned documents, wherein the objects comprise at least one of: a same text, a same letter, a same word, a same picture, a same logo, a same phrase, a same number, a same capitalization, a same case, and a same punctuation mark, and apply a document sorting algorithm to generate a score for each of the electronic representations of the plurality of scanned documents, wherein the score for each of the electronic representations of the plurality of scanned documents is generated based on a number of occurrences of the objects in the same identified physical locations between the electronic representations of the plurality of scanned documents; and
a document classifier configured to compare the generated score of each of the electronic representations of the plurality of scanned documents to identify a template document, wherein the template document is one of the plurality of scanned documents.
20. A system comprising:
an electronic scanner configured to scan a plurality of documents to produce electronic representations of the plurality of scanned documents;
a document processor comprising a microprocessor that is configured to receive the electronic representations of the plurality of scanned documents,
for each of the electronic representations of the plurality of scanned documents, identify a plurality of objects and locations of each of the plurality of objects, determine occurrences of objects in same identified physical locations of each of the plurality of objects between the electronic representations of the plurality of scanned documents, wherein the objects comprise at least one of: a same text, a same letter, a same word, a same picture, a same logo, a same phrase, a same number, a same capitalization, a same case, and a same punctuation mark, apply a document sorting algorithm to generate a score for each of the electronic representations of the plurality of scanned documents, wherein the score for each of the electronic representations of the plurality of scanned documents is generated based on a number of occurrences of the objects in the same identified physical locations between the electronic representations of the plurality of scanned documents, determine an amount of certainty for an occurrence of the objects in a common object document location between the electronic representations of the plurality of scanned documents, identify the common object document location based on a minimum certainty threshold value, determine that a template document contains a scanned error for an individual object in the common object document location in the template document, and generate an updated electronic template document by replacing the individual object in the common object document location in the template document with a second object in response to determining that the template document contains the scanned error for the individual object in the common object document location in the template document, wherein the second object is from the common object location in a second one of the electronic representations of the plurality of scanned documents that has been determined to be correct; and
a document classifier configured to compare the generated score of each of the electronic representations of the plurality of scanned documents to identify the template document, wherein the template document is one of the plurality of scanned documents and cluster the electronic representations of the plurality of scanned documents based on the template document.
2. The method of
determining an amount of certainty for an occurrence of objects in a common object document location between the electronic representations of the plurality of scanned documents;
identifying the common object document location based on a minimum certainty threshold value;
determining that the template document contains a scanned error for an individual object in the common object document location in the template document; and
in response to determining that the template document contains the scanned error for the individual object in the common object document location in the template document, generating an updated electronic template document, by replacing the individual object in the common object document location in the template document with a second object, wherein the second object is from the common object location in a second one of the electronic representations of the plurality of scanned documents that has been determined to be correct.
3. The method of
4. The method of
the two separate objects; or
a single one of the two separate objects.
5. The method of
6. The method of
7. The method of
8. The method of
summing the number of occurrences of objects between the electronic representations of the plurality of scanned documents; and
multiplying the number of occurrences of objects between the electronic representations of the plurality of scanned documents.
9. The method of
10. The method of
12. The system of
13. The system of
14. The system of
the two separate objects; or
a single one of the two separate objects.
15. The system of
16. The method of
17. The system of
summing the number of occurrences of objects between the electronic representations of the plurality of scanned documents; and
multiplying the number of occurrences of objects between the electronic representations of the plurality of scanned documents.
18. The system of
19. The system of
|
This application claims the benefit under 35 U.S.C. § 119(e) of U.S. Provisional Application No. 61/782,842, filed Mar. 14, 2013, entitled “PAGE ALIGNMENT AND CORRECTION,” U.S. Provisional Application No. 61/782,968 entitled “PAGE CLASSIFICATION,” filed Mar. 14, 2013, U.S. Provisional Application No. 61/783,012 entitled “PAGE CLUSTERING,” filed Mar. 14, 2013, U.S. Provisional Application No. 61/783,045 entitled “PAGE RECONSTRUCTION” filed Mar. 14, 2013, and U.S. Provisional Application No. 61/782,893 entitled “SMART ANCHOR” filed Mar. 14, 2013, the entire disclosures of all of which are incorporated herein by reference.
The systems and methods disclosed herein relate to document clustering systems and in particular to automated document clustering systems.
Currently, there are a variety of systems that allow a user to cluster documents for processing. Document clustering is useful for sorting common documents. For example, clustering can be used to sort invoices that are received from multiple vendors. Ideally, a user would like to have a clustering system be able to sort documents with 100% accuracy without any user intervention. However, current systems fall short of this goal. This is due to various factors, such as poor document quality, shortcomings in clustering algorithms, incorrect template identification, lack of complete automation, and the like.
A common document clustering algorithm is the k-means clustering algorithm. The k-means algorithm requires that a user define a template document that is used for clustering. In addition, the user typically has to choose a number of clusters. The system will then cluster a number of documents into the defined number of clusters. However, problems can arise when using k-means clustering. For instance, if the user selects a template document that has been scanned using a low scanning resolution and contains multiple scanning errors; this can result in the k-means algorithm incorrectly clustering documents. Moreover, because the user has to define a number of clusters, the k-means algorithm may incorrectly sort the documents because the number of clusters was defined incorrectly. These types of errors can cause multiple iterations of the process and increased manual intervention. What is needed is a method for clustering documents that overcomes the current problems in clustering systems.
Systems and methods are provided to solve these and other problems and disadvantages of the prior art. A group of similar documents are scanned for processing. For example, the documents can be a group of invoices from a specific vendor. The documents are processed in order to identify which of the documents will be used as template document. The template document is used to identify similar documents. For example, a company may use the template document to identify an individual invoice from a specific vendor from a group of documents from multiple vendors. Objects (e.g., a text object, such as a word) and their locations are identified in each of the documents. Occurrences of similar objects in the identified locations between the documents are determined. A document sorting algorithm is applied to generate a score for each of the documents. The score for each of the documents is generated based on a number of occurrences of similar objects between the documents. The generated score of each of the documents is used to identify the template document. The template document is then used to cluster the documents. This allows documents to be clustered without any user intervention.
The document manager 110 comprises a document scanner 111, a document processor 112, a document classifier 113, and a document scoring algorithm 114. The document scanner 111 can be any hardware/software that can be used to scan and generate documents 101, such as a scanner, a network scanner, a fax machine, a multifunction peripheral, and/or the like. The document processor 112 can be any hardware/software that can process and manage documents 101. The document classifier 113 can be any hardware/software that can be used to classify/group/cluster documents 101. The document scoring algorithm 114 can be any algorithm to scoring documents 101.
The documents 101 and/or the template document 102 may be physical documents that are scanned in by the document scanner 111. The documents 101 and/or the template document 102 may be generated by a device, such as a camera. The documents 101 and/or the template document 102 can be generated directly by a program, such as a word processing program, a spreadsheet, a presentation program, a graphical program, a picture management program, and/or the like. The documents 101 and/or the template document 102 can be in various forms, such as a Tagged Image File Format (TIFF) file, a Portable Document Format (PDF), a Rich Text Format (RTF), an Extended Markup Language (XML) document, a Hyper Text Markup Language (HTML) document/web page, a Graphics Interchange Format (GIF) file, and/or the like.
The documents 101 are received by the document processor 112. The documents 101 can be received from various sources, such as from the document scanner 111, a network scanner, a networked device, a database, a camera, and/or the like. The documents 101 can include a variety of objects. For example, objects in the documents 101 can include a text object, a picture object, an icon object, a graphic object, a logo object, a number, a symbol, a table, a graphical element, metadata in the documents 101, and/or the like. A text object may include a single letter, a word, a sentence, a paragraph, a heading, a page, a phrase, a footer, a header, a name, a marked change text, and/or the like. An object may comprise multiple objects. For instance, a picture may comprise multiple objects such as a car, a person, a building, and/or the like. A text object such as a sentence may comprise multiple text objects. Objects can be predefined. For example, text objects can include specific words or phrases.
The document processor 112 determines occurrences of similar objects in the identified locations of objects between the documents 101. The process of aligning and identifying locations of objects between documents is described in patent application Ser. No. 14/174,674 entitled “SYSTEM AND METHOD FOR DOCUMENT ALIGNMENT, CORRECTION, AND CLASSIFICATION,” which was filed on Feb. 6, 2014 and is incorporated herein in its entirety by reference.
In addition, distances may be determined by using a character, word and/or line distance of a document. This can be useful for documents that are semi-formatted documents such as Hyper Text Markup Language (HTML) documents where the spacing between the characters and lines is consistent. In this embodiment, the distance is calculated based on a number of characters, words, and/or lines that are between the two objects. For example, if one of the objects was on line 1 two characters in and the second objects was on line 2, 4 characters in. The system could calculate the distance based on the two objects being one line apart and 2 characters in.
An object may be similar in various ways. For instance, a text object can be similar if it is the same word (even though the word is in one font and/or color in one document 101 and in a different font and/or color in another document 101). Likewise, a picture or logo object can be similar if it is in color in one document 101 and in black and white in a different document 101. Likewise, other features of the object may be used in determining that objects are similar, such as font size, upper case/lower case, individual letters being capitalized, and/or the like. Similar objects can also include objects in an unknown language, a phrase, a number, a punctuation mark, and/or the like.
The document processor 112 determines occurrences of similar objects in the identified locations of the plurality of objects between the documents 101. Determining the occurrences of similar objects between documents 101 can be accomplished in various ways. For example, referring to
The result of determining occurrences of similar objects between the documents 101A-101E is shown in
Alternatively, this process could be used on other languages, such as computer programming languages, computer instructions, and/or the like. In this embodiment, the process can be used to identify if similar code has been used between programs. The objects could be computer instructions in binary or instructions in a higher level programming language, such as Hyper Text Markup Language (HTML), Java, Java script, C, C++, and/or the like.
In another embodiment, the objects can be physical objects that are in a container (similar to the document 101), such as parts or goods. The objects are identified based on their weights and/or sizes in order to show the occurrences of objects between the containers.
The document processor 112 applies a document sorting algorithm 114 to generate a score for each of the documents 101A-101E. The score for each of the documents 101A-101E is based on the number of occurrences of similar objects between the documents 101A-101E. The document sorting algorithm 114 for determining the score can be implemented in various ways.
For example, the document processor 112 can use a summing algorithm 114 that sums the number of occurrences of similar objects between the documents 101A-101E. The document processor 112 generates the following scores (based on the results in
Document 101B 1+8+3+2+1+3+1+8+1+3=31
Document 101C 1+1+3+1+1+1+3+1+1+3=16
Document 101D 1+8+3+2+2+3+2+8+2+1=32
Document 101E 1+8+1+1+1+1+3+8+2+1=27
The result is the following scores for each of the documents 101A-101E.
Document
Score
101A
29
101B
31
101C
16
101D
32
101E
27
The documents 101A-101E can be sorted based on the scores. In this example, document 101D has the highest score, which is based on a summing algorithm. The document 101D contains the fewest amount of errors. However, in other embodiments, a document sorting algorithm 114 that multiples the variables, divides the variables, sums the variables, subtracts the variables, separately or in combination can be used to generate a score for each of the documents 101A-101E.
In addition, other factors can be included in the document sorting algorithm 114, such as, scanning resolutions, sources of the documents 101, a type of document 101, a provider of the documents 101, use of color verses black and white scanned documents 101, fronts, font sizes, use of meta data in the documents 101, use of document change tracking, and/or the like. For example, the document sorting algorithm 114 could add an additional number to the score for documents that were scanned in color or use a specific font or combination of fonts. The document sorting algorithm 114 could add or subtract from the score for a document 101 if the document 101 does not use a specific font or was received from a specific vendor. Alternatively the score could be adjusted based on whether the document 101 was encoded using a specific format, such as HTML, PDF, or GIF. The score could be adjusted based on the document being generated from a specific word processing program or graphical processing program.
Based on the scores, the document classifier 113 identifies the template document 102. In this example, the document 101D is identified to be the template document 102 because document 101D has the highest score (32). The template document 102 (101D) can then be used to cluster documents 101A-101E along with additional documents 101.
Once the template document 102 has been identified, the first illustrative system 100 can use the documents 101 to make corrections to the template document 102. Error correction in the template document 102 is further illustrated in
The amount of certainty is the degree to which the object occurs in the common locations between the documents 101A-101E relative to the total number of documents (five in this example). The amount of certainty for documents 101A-101E in
The document processor 112 identifies common object locations based on a minimum certainty threshold value. The minimum certainty threshold value is a minimum number of documents 101A-101E that the object occurs in. The minimum certainty threshold value can be a user defined value that indicates that there is a high probability that the object is correct and has been scanned properly. Alternatively, the minimum certainty threshold value can be defined by the system. In one embodiment, the minimum certainty threshold value can set based on the type of object. For example, a minimum certainty threshold value of 60% can be defined for graphical objects and a minimum certainty threshold value of 80% can be defined for text objects. The minimum certainty threshold value can be defined based on a file type, a font, a font size, a language, and/or the like.
An 80% minimum certainty threshold value in this example is a good threshold because it indicates that the word object is in at least four of the five documents 101A-101E at the common location between the documents 101A-101E. In
In this example, the word “the” occurs twice in each document 101A-101E at two different locations. In this embodiment, each location of the word “the” is identified as a separate text object with a separate certainty value. However, in an alternative embodiment, the certainty of the word “the” can be based on how often the word “the” occurs in the documents 101A-101E regardless of the number of locations. For example, the word “the” occurs in each of the documents at least once.
The document processor 112 determines that the template document 102 (assume that document 101D is the template document 102 in this example) contains an error for an individual word object in at least one of the common document locations. In this example, the template document 102 (101D) contain an error in the word jumps (?umps). In response to determining the error in the template document 102 (101D), the document processor 112 replaces the object “?umps” with an individual object in the common location between the documents 101 in the template document 102 (101D). In this example, the document processor 112 would replace the object “?umps” in the template document 102 (101D) with the word jumps because the word jumps is determined to be correct because it is common in the other four documents at the same common location. By removing errors from the template document 102, improved clustering of documents 101 can result because the template document 101 now contains fewer errors.
In one embodiment, the corrected template document 102 is stored separately by the document processor 112 to process the documents 101 for clustering. However, in other embodiments the actual document 101D is error corrected before clustering. Likewise, this process can be applied to the remaining documents 101A-101C and 101E before clustering.
The computer 130 can be any computing device, such as a personal computer, a Personal Digital Assistant (PDA), a telephone, a smart telephone, a laptop computer, a tablet computer, and/or the like. The server 131 can be any hardware/software that can manage documents 101/102, such as a file server, a database server, a web server, and/or the like. The server 131 further comprises a database 132. The database 132 can be any type of database, such as relational database, an object oriented database, a directory service, a file system, and/or the like. The database 132 comprises the documents 101. In this illustrative embodiment, the document manager 110 only comprises the document processor 112 and the document classifier 113.
The document manager 110 is connected to a network 120. The network 120 can be or may include any network that can send and receive information, such as the Internet, a Wide Area Network (WAN), a Local Area Network (LAN), a Voice over IP Network (VoIP), the Public Switched Telephone Network (PSTN), a packet switched network, a circuit switched network, a cellular network, a combination of these, and the like. The network 120 can use a variety of protocols, such as Ethernet, Internet Protocol (IP), 802.11G, Simple Network Management Protocol (SNMP), and the like.
The document processor 112 can receive documents 101 from the devices 130, 131, and 132 on the network 110. For example, a user at computer 130 could create the documents 101 that are either sent directly to the document processor 112 or stored in the database 132. A user at computer 130 could fill out an invoice (document 101) and send the invoice to a company for processing. The invoice (document 101) could then be stored in the database 132 for processing by the document processor 112 as described in
Alternatively, the network scanner 133 could be used to scan the documents 101 for storage in the database 132. The scanned documents 101 could be sent directly to the document processor 112 from the network scanner 133.
In another embodiment, the document processor 112 can periodically retrieve the documents 101 from the file server 131 via the database 132 for processing. This way, invoices/contracts can be processed based on pay periods or other time periods.
The process starts in step 700. The documents 101 are received in step 702. Objects and their locations in the received documents are identified in step 704. Identifying locations of objects can be optionally based on the occurrences of similar objects in common locations between the documents in step 706. Occurrences of similar objects are determined in the identified locations of the objects between the documents in step 708. In addition to the number of occurrences of similar objects between the documents, other factors can be considered, such as fonts, font size, object color, object size, relative object size, weight, three dimensional object size, plural verses singular (i.e., “page” versus “pages” can be considered the same word), misalignment due to use of different fonts/font sizes, and/or the like.
For example, if one vendor uses one font and a second vendor uses a different font for the same document that misaligns the word locations relative to each other in the documents, the process can take into consideration the differences in the fonts/font sizes to compare the documents even though the word locations are different (i.e., a font causes a word object or picture object to move to the next line because the font uses a bigger character size than another font). The process will calculate a different location based on the different font/font size. This process can be accomplished across multiple pages within a document.
Identification of documents with misaligned objects can be initially identified based on word counts between the documents. Even though one of the documents is misaligned, the process of counting words still works in the same manner.
An algorithm is applied to generate a score for each of the documents in step 710. The score is generated based on at least a number of occurrences of similar objects between the documents in step 710. The process compares the scores of the documents to identify a template document in step 712. The process clusters the documents based on the template document in step 714. The process ends in step 716.
If there is not an error in the template document in step 806, the process goes to step 712. If there is an error in the template document in step 806, the process replaces the individual object(s) in the template document at the common document locations with a second object(s). The second object(s) is from the common object location(s) of a second one of the documents that has been determined to be correct. The process goes to step 712.
The above described process can be used compare the documents 101F and 101G along with other documents to generate scores of the documents 101F-101G. The scores can then be used to identify a template document 102. In this example, the document 101F would be identified as the template document because it has fewer errors than the document 102G.
Of course, various changes and modifications to the illustrative embodiment described above will be apparent to those skilled in the art. These changes and modifications can be made without departing from the spirit and the scope of the system and method and without diminishing its attendant advantages. The following claims specify the scope of the invention. Those skilled in the art will appreciate that the features described above can be combined in various ways to form multiple variations of the invention. As a result, the invention is not limited to the specific embodiments described above, but only by the following claims and their equivalents.
Patent | Priority | Assignee | Title |
ER8074, |
Patent | Priority | Assignee | Title |
7430717, | Sep 26 2000 | International Business Machines Corporation | Method for adapting a K-means text clustering to emerging data |
7742953, | Apr 01 2004 | Kyocera Corporation | Adding information or functionality to a rendered document via association with an electronic counterpart |
8005720, | Feb 15 2004 | Kyocera Corporation | Applying scanned information to identify content |
8249871, | Nov 18 2005 | Microsoft Technology Licensing, LLC | Word clustering for input data |
8509525, | Apr 06 2011 | GOOGLE LLC | Clustering of forms from large-scale scanned-document collection |
8843494, | Mar 28 2012 | Open Text Corporation | Method and system for using keywords to merge document clusters |
8954440, | Apr 09 2010 | Walmart Apollo, LLC | Selectively delivering an article |
9507758, | Jul 03 2013 | ICEBOX INC , | Collaborative matter management and analysis |
20020083079, | |||
20060116994, | |||
20070189615, | |||
20070217701, | |||
20090043824, | |||
20100312797, | |||
20110137900, | |||
20130054620, | |||
20130304742, | |||
20140029857, |
Executed on | Assignor | Assignee | Conveyance | Frame | Reel | Doc |
Feb 25 2014 | GHESSASSI, KARIM | Digitech Systems Private Reserve, LLC | ASSIGNMENT OF ASSIGNORS INTEREST SEE DOCUMENT FOR DETAILS | 032366 | /0346 | |
Feb 28 2014 | Digitech Systems Private Reserve, LLC | (assignment on the face of the patent) | / |
Date | Maintenance Fee Events |
Date | Maintenance Schedule |
Sep 21 2024 | 4 years fee payment window open |
Mar 21 2025 | 6 months grace period start (w surcharge) |
Sep 21 2025 | patent expiry (for year 4) |
Sep 21 2027 | 2 years to revive unintentionally abandoned end. (for year 4) |
Sep 21 2028 | 8 years fee payment window open |
Mar 21 2029 | 6 months grace period start (w surcharge) |
Sep 21 2029 | patent expiry (for year 8) |
Sep 21 2031 | 2 years to revive unintentionally abandoned end. (for year 8) |
Sep 21 2032 | 12 years fee payment window open |
Mar 21 2033 | 6 months grace period start (w surcharge) |
Sep 21 2033 | patent expiry (for year 12) |
Sep 21 2035 | 2 years to revive unintentionally abandoned end. (for year 12) |