Document clustering and reconstruction

Document clustering and reconstruction
US11126839

A scanner scans a group of documents. For example, the documents can be a group of invoices. The documents are received and processed. objects (e.g., a text object, such as a word) and their locations are identified in each of the documents. occurrences of similar objects in the identified locations between the documents are determined. A document sorting algorithm is applied to generate a score for each of the documents. The score for each of the documents is generated based on a number of occurrences of similar objects between the documents. The generated score of each of the documents is used to identify a template document. The template document is then used to cluster the documents.

PTO Wrapper PDF
Dossier Espace Google

Patent 11126839
Priority Mar 14 2013
Filed Feb 28 2014
Issued Sep 21 2021
Expiry Mar 14 2037 Extension 1110 days
Inventors Ghessassi,…
Assg.orig Digitech S…
Assg.curr Digitech S…
Entity Small
Referenced by 1
References 18
Maint.: currently ok

CROSS REFERENCE TO R…
TECHNICAL FIELD
BACKGROUND
SUMMARY
BRIEF DESCRIPTION OF…
DETAILED DESCRIPTION

1. A method comprising:

scanning, by an electronic scanner, a plurality of documents to produce electronic representations of the plurality of scanned documents;

receiving, by a microprocessor, the electronic representations of the plurality of scanned documents;

for each of the electronic representations of the plurality of scanned documents, identifying, by the microprocessor, a plurality of objects and physical locations of each of the plurality of objects within the electronic representations of the plurality of scanned documents;

determining, by the microprocessor, occurrences of objects in same identified physical locations of each of the plurality of objects between the electronic representations of the plurality of scanned documents, wherein the objects comprise at least one of: a same text, a same letter, a same word, a same picture, a same logo, a same phrase, a same number, a same capitalization, a same case, and a same punctuation mark;

applying, by the microprocessor, a document sorting algorithm to generate a score for each of the electronic representations of the plurality of scanned documents, wherein the score for each of the electronic representations of the plurality of scanned documents is generated based on a number of occurrences of objects in the same identified physical locations between the plurality of scanned documents; and

comparing, by the microprocessor, the generated score of each of the electronic representations of the plurality of scanned documents to identify a template document, wherein the template document is one of the plurality of scanned documents.

11. A system comprising:

an electronic scanner configured to scan a plurality of documents to produce the electronic representations of the plurality of scanned documents;

a document processor comprising a microprocessor configured to receive the electronic representations of the plurality of scanned documents, for each of the electronic representations of the plurality of scanned documents, identify a plurality of objects and physical locations of each of the plurality of objects, determining occurrences of objects in same identified physical locations of each of the plurality of objects between the electronic representations of the plurality of scanned documents within the electronic representations of the plurality of scanned documents, wherein the objects comprise at least one of: a same text, a same letter, a same word, a same picture, a same logo, a same phrase, a same number, a same capitalization, a same case, and a same punctuation mark, and apply a document sorting algorithm to generate a score for each of the electronic representations of the plurality of scanned documents, wherein the score for each of the electronic representations of the plurality of scanned documents is generated based on a number of occurrences of the objects in the same identified physical locations between the electronic representations of the plurality of scanned documents; and

a document classifier configured to compare the generated score of each of the electronic representations of the plurality of scanned documents to identify a template document, wherein the template document is one of the plurality of scanned documents.

20. A system comprising:

an electronic scanner configured to scan a plurality of documents to produce electronic representations of the plurality of scanned documents;

a document processor comprising a microprocessor that is configured to receive the electronic representations of the plurality of scanned documents,

for each of the electronic representations of the plurality of scanned documents, identify a plurality of objects and locations of each of the plurality of objects, determine occurrences of objects in same identified physical locations of each of the plurality of objects between the electronic representations of the plurality of scanned documents, wherein the objects comprise at least one of: a same text, a same letter, a same word, a same picture, a same logo, a same phrase, a same number, a same capitalization, a same case, and a same punctuation mark, apply a document sorting algorithm to generate a score for each of the electronic representations of the plurality of scanned documents, wherein the score for each of the electronic representations of the plurality of scanned documents is generated based on a number of occurrences of the objects in the same identified physical locations between the electronic representations of the plurality of scanned documents, determine an amount of certainty for an occurrence of the objects in a common object document location between the electronic representations of the plurality of scanned documents, identify the common object document location based on a minimum certainty threshold value, determine that a template document contains a scanned error for an individual object in the common object document location in the template document, and generate an updated electronic template document by replacing the individual object in the common object document location in the template document with a second object in response to determining that the template document contains the scanned error for the individual object in the common object document location in the template document, wherein the second object is from the common object location in a second one of the electronic representations of the plurality of scanned documents that has been determined to be correct; and

a document classifier configured to compare the generated score of each of the electronic representations of the plurality of scanned documents to identify the template document, wherein the template document is one of the plurality of scanned documents and cluster the electronic representations of the plurality of scanned documents based on the template document.

2. The method of claim 1, further comprising:

determining an amount of certainty for an occurrence of objects in a common object document location between the electronic representations of the plurality of scanned documents;

identifying the common object document location based on a minimum certainty threshold value;

determining that the template document contains a scanned error for an individual object in the common object document location in the template document; and

in response to determining that the template document contains the scanned error for the individual object in the common object document location in the template document, generating an updated electronic template document, by replacing the individual object in the common object document location in the template document with a second object, wherein the second object is from the common object location in a second one of the electronic representations of the plurality of scanned documents that has been determined to be correct.

3. The method of claim 2, wherein the scanned error for the individual object in the template document is determined based on a number of occurrences of the objects in the common document location in the electronic representations of the plurality of scanned documents.

4. The method of claim 2, wherein each of the electronic representations of the plurality of scanned documents comprises at least two separate objects that are at different common locations between the electronic representations of the plurality of scanned documents and wherein the amount of certainty is determined based on one of:

the two separate objects; or

a single one of the two separate objects.

5. The method of claim 1, wherein the objects comprises at least one of: a text object in an unknown language, an object that is part of a computer programming language, a phrase, a number, a punctuation mark, a text object, a graphical object, a logo, and a picture.

6. The method of claim 1, wherein the common locations are determined based on at one or more of a distance, a relative distance, a relative angle, a character distance, a word distance, and line distance.

7. The method of claim 1, wherein the objects are text objects and wherein the electronic representations of the plurality of scanned documents are received from the electronic scanner.

8. The method of claim 1, wherein the document sorting algorithm generates the score for each of the electronic representations of the plurality of scanned documents by at least one of:

summing the number of occurrences of objects between the electronic representations of the plurality of scanned documents; and

multiplying the number of occurrences of objects between the electronic representations of the plurality of scanned documents.

9. The method of claim 1, wherein the template document is used to cluster the electronic representations of the plurality of scanned documents based on the template document.

10. The method of claim 1, wherein determining occurrences of the objects in the identified physical locations of the plurality of objects between the electronic representations of the plurality of scanned documents further comprises recalculating the identified physical locations based on a misalignment of the identified physical locations due to a use of at least one of a different font and a different font size used by a scanner when electronically scanning one or more of the electronic representations of the plurality of scanned documents.

12. The system of claim 11, wherein the document processor is further configured to determine an amount of certainty for an occurrence of objects in a common object document location between the electronic representations of the plurality of scanned documents, identify the common object document location based on a minimum certainty threshold value, determine that the template document contains a scanned error for an individual object in the common object document location in the template document, and generate an updated electronic template document by replacing the individual object in the common object document location in the template document with a second object in response to determining that the template document contains the scanned error for the individual object in the common object document location in the template document, wherein the second object is from the common object location in a second one of the electronic representations of the plurality of scanned documents that has been determined to be correct.

13. The system of claim 12, wherein the scanned error for the individual object in the template document is determined based on a number of occurrences of the objects in the common document location in the electronic representations of the plurality of scanned documents.

14. The system of claim 12, wherein each of the electronic representations of the plurality of scanned documents comprises at least two separate objects that are at different common locations between the electronic representations of the plurality of scanned documents and wherein the amount of certainty is determined based on one of:

the two separate objects; or

a single one of the two separate objects.

15. The system of claim 11, wherein the objects comprises at least one of: a text object in an unknown language, an object that is part of a computer programming language, a phrase, a number, a punctuation mark, a text object, a graphical object, a logo, and a picture.

16. The method of claim 11, further comprising a scanner that generates the electronic representations of the plurality of scanned documents and wherein the objects are text objects.

17. The system of claim 11, wherein the document sorting algorithm generates the score for each of the electronic representations of the plurality of scanned documents by at least one of:

summing the number of occurrences of objects between the electronic representations of the plurality of scanned documents; and

multiplying the number of occurrences of objects between the electronic representations of the plurality of scanned documents.

18. The system of claim 11, wherein the template document is used to cluster the electronic representations of the plurality of scanned documents based on the template document.

19. The system of claim 11, wherein document processor is further configured to recalculate the identified physical locations based on a misalignment of the identified physical locations due to a use of at least one of a different font and a different font size used by an electronic scanner when electronically scanning one or more of the electronic representations of the plurality of scanned documents.

CROSS REFERENCE TO RELATED APPLICATIONS

This application claims the benefit under 35 U.S.C. § 119(e) of U.S. Provisional Application No. 61/782,842, filed Mar. 14, 2013, entitled “PAGE ALIGNMENT AND CORRECTION,” U.S. Provisional Application No. 61/782,968 entitled “PAGE CLASSIFICATION,” filed Mar. 14, 2013, U.S. Provisional Application No. 61/783,012 entitled “PAGE CLUSTERING,” filed Mar. 14, 2013, U.S. Provisional Application No. 61/783,045 entitled “PAGE RECONSTRUCTION” filed Mar. 14, 2013, and U.S. Provisional Application No. 61/782,893 entitled “SMART ANCHOR” filed Mar. 14, 2013, the entire disclosures of all of which are incorporated herein by reference.

TECHNICAL FIELD

The systems and methods disclosed herein relate to document clustering systems and in particular to automated document clustering systems.

BACKGROUND

Currently, there are a variety of systems that allow a user to cluster documents for processing. Document clustering is useful for sorting common documents. For example, clustering can be used to sort invoices that are received from multiple vendors. Ideally, a user would like to have a clustering system be able to sort documents with 100% accuracy without any user intervention. However, current systems fall short of this goal. This is due to various factors, such as poor document quality, shortcomings in clustering algorithms, incorrect template identification, lack of complete automation, and the like.

A common document clustering algorithm is the k-means clustering algorithm. The k-means algorithm requires that a user define a template document that is used for clustering. In addition, the user typically has to choose a number of clusters. The system will then cluster a number of documents into the defined number of clusters. However, problems can arise when using k-means clustering. For instance, if the user selects a template document that has been scanned using a low scanning resolution and contains multiple scanning errors; this can result in the k-means algorithm incorrectly clustering documents. Moreover, because the user has to define a number of clusters, the k-means algorithm may incorrectly sort the documents because the number of clusters was defined incorrectly. These types of errors can cause multiple iterations of the process and increased manual intervention. What is needed is a method for clustering documents that overcomes the current problems in clustering systems.

SUMMARY

Systems and methods are provided to solve these and other problems and disadvantages of the prior art. A group of similar documents are scanned for processing. For example, the documents can be a group of invoices from a specific vendor. The documents are processed in order to identify which of the documents will be used as template document. The template document is used to identify similar documents. For example, a company may use the template document to identify an individual invoice from a specific vendor from a group of documents from multiple vendors. Objects (e.g., a text object, such as a word) and their locations are identified in each of the documents. Occurrences of similar objects in the identified locations between the documents are determined. A document sorting algorithm is applied to generate a score for each of the documents. The score for each of the documents is generated based on a number of occurrences of similar objects between the documents. The generated score of each of the documents is used to identify the template document. The template document is then used to cluster the documents. This allows documents to be clustered without any user intervention.

BRIEF DESCRIPTION OF THE DRAWINGS

FIG. 1 is a block diagram of a first illustrative system for clustering documents.

FIG. 2 is a block diagram of a second illustrative system for clustering documents.

FIG. 3 is an exemplary diagram of documents that are used for clustering.

FIG. 4 is a table of occurrences of objects in documents that are clustered.

FIG. 5 is an exemplary diagram of documents that are used for clustering and template reconstruction.

FIG. 6 is a table of occurrences of objects in documents and the percentage of certainty of each object.

FIG. 7 is a flow diagram of a process for clustering documents.

FIG. 8 is a flow diagram of a process for reconstructing a template document.

FIG. 9 is a diagram of two exemplary documents that are used to determine a template document.

DETAILED DESCRIPTION

FIG. 1 is a block diagram of a first illustrative system for clustering documents. The first illustrative system 100 comprises a document manager 110. The document manager 110 can be any hardware/software that can be used to manage documents 101, such as a scanner, a multifunction peripheral, a fax machine, a network based software application, and/or the like.

The document manager 110 comprises a document scanner 111, a document processor 112, a document classifier 113, and a document scoring algorithm 114. The document scanner 111 can be any hardware/software that can be used to scan and generate documents 101, such as a scanner, a network scanner, a fax machine, a multifunction peripheral, and/or the like. The document processor 112 can be any hardware/software that can process and manage documents 101. The document classifier 113 can be any hardware/software that can be used to classify/group/cluster documents 101. The document scoring algorithm 114 can be any algorithm to scoring documents 101.

FIG. 1 also includes the documents 101. The documents 101 are documents that are to be clustered or grouped into different classifications of documents. The documents 101 can be any type of document 101 that needs to be sorted or clustered, such as an invoice, a medical from, an employment agreement, a contract, a bill, a legal document, a tracking form, a ticket, a check, and the like. The documents 101 can be any number of documents 101. The documents 101 can be a single page, multiple pages, web pages, graphical pages, pictures, spread sheets, files, and/or the like. The documents are processed to identify one or more template documents 102. The template document 102 is a document that is used as a reference to cluster the documents 101 into different groups. The documents 101 used for identifying the template document 102 are typically similar documents 101. The template document 102 is identified from the documents 101 as the best document to perform clustering on the documents 101. For example, the documents 101 may be a group of similar invoices or contracts from a specific vendor. Once a template document 102 is identified, the documents 101 can then be clustered. For example, a company's billing department may receive hundreds of invoices from vendor A and vendor B. The company can use an identified template document 102 to cluster the invoices from vendor A in to one cluster and the invoices from vendor B into another cluster of documents. The advantage is that the documents 101 do not have to be sorted manually.

The documents 101 and/or the template document 102 may be physical documents that are scanned in by the document scanner 111. The documents 101 and/or the template document 102 may be generated by a device, such as a camera. The documents 101 and/or the template document 102 can be generated directly by a program, such as a word processing program, a spreadsheet, a presentation program, a graphical program, a picture management program, and/or the like. The documents 101 and/or the template document 102 can be in various forms, such as a Tagged Image File Format (TIFF) file, a Portable Document Format (PDF), a Rich Text Format (RTF), an Extended Markup Language (XML) document, a Hyper Text Markup Language (HTML) document/web page, a Graphics Interchange Format (GIF) file, and/or the like.

The documents 101 are received by the document processor 112. The documents 101 can be received from various sources, such as from the document scanner 111, a network scanner, a networked device, a database, a camera, and/or the like. The documents 101 can include a variety of objects. For example, objects in the documents 101 can include a text object, a picture object, an icon object, a graphic object, a logo object, a number, a symbol, a table, a graphical element, metadata in the documents 101, and/or the like. A text object may include a single letter, a word, a sentence, a paragraph, a heading, a page, a phrase, a footer, a header, a name, a marked change text, and/or the like. An object may comprise multiple objects. For instance, a picture may comprise multiple objects such as a car, a person, a building, and/or the like. A text object such as a sentence may comprise multiple text objects. Objects can be predefined. For example, text objects can include specific words or phrases.

The document processor 112 determines occurrences of similar objects in the identified locations of objects between the documents 101. The process of aligning and identifying locations of objects between documents is described in patent application Ser. No. 14/174,674 entitled “SYSTEM AND METHOD FOR DOCUMENT ALIGNMENT, CORRECTION, AND CLASSIFICATION,” which was filed on Feb. 6, 2014 and is incorporated herein in its entirety by reference.

In addition, distances may be determined by using a character, word and/or line distance of a document. This can be useful for documents that are semi-formatted documents such as Hyper Text Markup Language (HTML) documents where the spacing between the characters and lines is consistent. In this embodiment, the distance is calculated based on a number of characters, words, and/or lines that are between the two objects. For example, if one of the objects was on line 1 two characters in and the second objects was on line 2, 4 characters in. The system could calculate the distance based on the two objects being one line apart and 2 characters in.

An object may be similar in various ways. For instance, a text object can be similar if it is the same word (even though the word is in one font and/or color in one document 101 and in a different font and/or color in another document 101). Likewise, a picture or logo object can be similar if it is in color in one document 101 and in black and white in a different document 101. Likewise, other features of the object may be used in determining that objects are similar, such as font size, upper case/lower case, individual letters being capitalized, and/or the like. Similar objects can also include objects in an unknown language, a phrase, a number, a punctuation mark, and/or the like.

The document processor 112 determines occurrences of similar objects in the identified locations of the plurality of objects between the documents 101. Determining the occurrences of similar objects between documents 101 can be accomplished in various ways. For example, referring to FIG. 3, the documents 101 may comprise documents 101A-101E. The documents 101A-101E are all documents 101 that are scanned by the document scanner 111. The documents 101A-101E each contains a unique word (aaa-eee) and the phrase “the quick brown fox jumps over the lazy dog.” The process of scanning the documents 101A-101E has introduced errors into each of the documents 101A-101E. In this illustrative example, the errors are indicated by a question mark. In document 101A, the words quick, brown, over, and lazy were all scanned with at least one error. These types of errors can be introduced into the documents 101 in various ways. For example, the errors can be based on poor scanning quality, using a low scanning resolution (e.g., a low number of Dots Per Inch (DPI)), where the document 101 has been scanned multiple times, human spelling errors, and/or the like. The phrase “the quick brown fox jumps over the lazy dog” and the unique words are, in this example, are all at common locations (or approximately common locations) between the documents 101A-101E. However, having the objects at common locations is not necessary for this step. In one embodiment, the occurrences are based on the objects being in common locations and in another embodiment the occurrences are only based on the number of occurrences in all of the documents 101A-101E. In addition, the common locations can be determined based on one or more of a distance, a relative distance, and/or a relative angle.

The result of determining occurrences of similar objects between the documents 101A-101E is shown in FIG. 4. In this example, the object “the” has eight occurrences in the documents 101A-101E. The object “jumps” has three occurrences and so forth down to d?g”, which has one occurrence. The more occurrences of an object, the more likely that the object is a valid object and does not include an error. The advantage to this approach is that it is language agnostic. The document processor 112 identifies objects based on the number of occurrences of the object in the documents 101A-101E. The document processor 112 does not have to know what language is being used. For example, this process could be applied to other languages, such as Spanish, German, French, and/or the like without any changes.

Alternatively, this process could be used on other languages, such as computer programming languages, computer instructions, and/or the like. In this embodiment, the process can be used to identify if similar code has been used between programs. The objects could be computer instructions in binary or instructions in a higher level programming language, such as Hyper Text Markup Language (HTML), Java, Java script, C, C++, and/or the like.

In another embodiment, the objects can be physical objects that are in a container (similar to the document 101), such as parts or goods. The objects are identified based on their weights and/or sizes in order to show the occurrences of objects between the containers.

The document processor 112 applies a document sorting algorithm 114 to generate a score for each of the documents 101A-101E. The score for each of the documents 101A-101E is based on the number of occurrences of similar objects between the documents 101A-101E. The document sorting algorithm 114 for determining the score can be implemented in various ways.

For example, the document processor 112 can use a summing algorithm 114 that sums the number of occurrences of similar objects between the documents 101A-101E. The document processor 112 generates the following scores (based on the results in FIG. 4) for the “aaa the ?uick ?rown fox jumps ?ver the ?azy dog” in document 101A as follows: 1 (for aaa)+8 (for the)+1 (for ?uick)+1 (for ?rown)+2 (for fox)+3 (for jumps)+1 (for ?over)+8 (for the)+1 (for ?azy)+3 (for dog) to equal a total of 29. Similarly, the document processor 112 generates the following scores for the remaining documents 101B-101E.

Document 101B 1+8+3+2+1+3+1+8+1+3=31

Document 101C 1+1+3+1+1+1+3+1+1+3=16

Document 101D 1+8+3+2+2+3+2+8+2+1=32

Document 101E 1+8+1+1+1+1+3+8+2+1=27

The result is the following scores for each of the documents 101A-101E.


	Document	Score

	101A	29
	101B	31
	101C	16
	101D	32
	101E	27

The documents 101A-101E can be sorted based on the scores. In this example, document 101D has the highest score, which is based on a summing algorithm. The document 101D contains the fewest amount of errors. However, in other embodiments, a document sorting algorithm 114 that multiples the variables, divides the variables, sums the variables, subtracts the variables, separately or in combination can be used to generate a score for each of the documents 101A-101E.

In addition, other factors can be included in the document sorting algorithm 114, such as, scanning resolutions, sources of the documents 101, a type of document 101, a provider of the documents 101, use of color verses black and white scanned documents 101, fronts, font sizes, use of meta data in the documents 101, use of document change tracking, and/or the like. For example, the document sorting algorithm 114 could add an additional number to the score for documents that were scanned in color or use a specific font or combination of fonts. The document sorting algorithm 114 could add or subtract from the score for a document 101 if the document 101 does not use a specific font or was received from a specific vendor. Alternatively the score could be adjusted based on whether the document 101 was encoded using a specific format, such as HTML, PDF, or GIF. The score could be adjusted based on the document being generated from a specific word processing program or graphical processing program.

Based on the scores, the document classifier 113 identifies the template document 102. In this example, the document 101D is identified to be the template document 102 because document 101D has the highest score (32). The template document 102 (101D) can then be used to cluster documents 101A-101E along with additional documents 101.

Once the template document 102 has been identified, the first illustrative system 100 can use the documents 101 to make corrections to the template document 102. Error correction in the template document 102 is further illustrated in FIG. 5. In FIG. 5, each of the documents 101A-101E contain one error. For example, in document 101A, the word quick has been scanned with an error (?qick). The document processor 112 determines an amount of certainty for occurrences of similar objects in a common document location between the documents 101A-101E. In FIG. 5, each of the word objects in the documents 101A-101E beginning with aaa-eee through dog are in common document locations between the documents 101A-101E.

The amount of certainty is the degree to which the object occurs in the common locations between the documents 101A-101E relative to the total number of documents (five in this example). The amount of certainty for documents 101A-101E in FIG. 5 is shown in FIG. 6. The word objects the, the, lazy, and dog occur in all the common locations in each document 101A-101E. The word objects over, jumps, fox, brown, and quick only occur four times because each of these word objects contain an error in one of the documents 101A-101E. The remaining word objects only occur once in the documents 101A-101E.

The document processor 112 identifies common object locations based on a minimum certainty threshold value. The minimum certainty threshold value is a minimum number of documents 101A-101E that the object occurs in. The minimum certainty threshold value can be a user defined value that indicates that there is a high probability that the object is correct and has been scanned properly. Alternatively, the minimum certainty threshold value can be defined by the system. In one embodiment, the minimum certainty threshold value can set based on the type of object. For example, a minimum certainty threshold value of 60% can be defined for graphical objects and a minimum certainty threshold value of 80% can be defined for text objects. The minimum certainty threshold value can be defined based on a file type, a font, a font size, a language, and/or the like.

An 80% minimum certainty threshold value in this example is a good threshold because it indicates that the word object is in at least four of the five documents 101A-101E at the common location between the documents 101A-101E. In FIG. 6, a minimum certainty threshold value of 80% would select the words the, the, lazy, dog, over, jumps, fox, brown, and quick as common objects between documents 101A-101E. The remaining objects would not be considered common objects because their certainty value is only 20%.

In this example, the word “the” occurs twice in each document 101A-101E at two different locations. In this embodiment, each location of the word “the” is identified as a separate text object with a separate certainty value. However, in an alternative embodiment, the certainty of the word “the” can be based on how often the word “the” occurs in the documents 101A-101E regardless of the number of locations. For example, the word “the” occurs in each of the documents at least once.

The document processor 112 determines that the template document 102 (assume that document 101D is the template document 102 in this example) contains an error for an individual word object in at least one of the common document locations. In this example, the template document 102 (101D) contain an error in the word jumps (?umps). In response to determining the error in the template document 102 (101D), the document processor 112 replaces the object “?umps” with an individual object in the common location between the documents 101 in the template document 102 (101D). In this example, the document processor 112 would replace the object “?umps” in the template document 102 (101D) with the word jumps because the word jumps is determined to be correct because it is common in the other four documents at the same common location. By removing errors from the template document 102, improved clustering of documents 101 can result because the template document 101 now contains fewer errors.

In one embodiment, the corrected template document 102 is stored separately by the document processor 112 to process the documents 101 for clustering. However, in other embodiments the actual document 101D is error corrected before clustering. Likewise, this process can be applied to the remaining documents 101A-101C and 101E before clustering.

FIG. 2 is a block diagram of a second illustrative system 200 for clustering documents 101. The second illustrative system 200 is an illustration of the system of FIG. 1 in a networked environment. The second illustrative system 200 comprises a computer 130, a server 131, a network scanner 133, a network 120, and the document manager 110.

The computer 130 can be any computing device, such as a personal computer, a Personal Digital Assistant (PDA), a telephone, a smart telephone, a laptop computer, a tablet computer, and/or the like. The server 131 can be any hardware/software that can manage documents 101/102, such as a file server, a database server, a web server, and/or the like. The server 131 further comprises a database 132. The database 132 can be any type of database, such as relational database, an object oriented database, a directory service, a file system, and/or the like. The database 132 comprises the documents 101. In this illustrative embodiment, the document manager 110 only comprises the document processor 112 and the document classifier 113.

The document manager 110 is connected to a network 120. The network 120 can be or may include any network that can send and receive information, such as the Internet, a Wide Area Network (WAN), a Local Area Network (LAN), a Voice over IP Network (VoIP), the Public Switched Telephone Network (PSTN), a packet switched network, a circuit switched network, a cellular network, a combination of these, and the like. The network 120 can use a variety of protocols, such as Ethernet, Internet Protocol (IP), 802.11G, Simple Network Management Protocol (SNMP), and the like.

The document processor 112 can receive documents 101 from the devices 130, 131, and 132 on the network 110. For example, a user at computer 130 could create the documents 101 that are either sent directly to the document processor 112 or stored in the database 132. A user at computer 130 could fill out an invoice (document 101) and send the invoice to a company for processing. The invoice (document 101) could then be stored in the database 132 for processing by the document processor 112 as described in FIG. 1.

Alternatively, the network scanner 133 could be used to scan the documents 101 for storage in the database 132. The scanned documents 101 could be sent directly to the document processor 112 from the network scanner 133.

In another embodiment, the document processor 112 can periodically retrieve the documents 101 from the file server 131 via the database 132 for processing. This way, invoices/contracts can be processed based on pay periods or other time periods.

FIG. 7 is a flow diagram of a process for clustering documents. Illustratively, the document manager 110, the document scanner 111, the document processor 112, the document classifier 113, the computer 130, the server 131, and the network scanner 133 are stored-program-controlled entities, such as a computer or processor, which performs the method of FIGS. 7-8 and the processes described herein by executing program instructions stored in a tangible computer readable storage medium, such as a memory or disk. Although the methods described in FIGS. 7-8 are shown in a specific order, one of skill in the art would recognize that the steps in FIGS. 7-8 may be implemented in different orders and/or be implemented in a multi-threaded environment. Moreover, various steps may be omitted or added based on implementation.

The process starts in step 700. The documents 101 are received in step 702. Objects and their locations in the received documents are identified in step 704. Identifying locations of objects can be optionally based on the occurrences of similar objects in common locations between the documents in step 706. Occurrences of similar objects are determined in the identified locations of the objects between the documents in step 708. In addition to the number of occurrences of similar objects between the documents, other factors can be considered, such as fonts, font size, object color, object size, relative object size, weight, three dimensional object size, plural verses singular (i.e., “page” versus “pages” can be considered the same word), misalignment due to use of different fonts/font sizes, and/or the like.

For example, if one vendor uses one font and a second vendor uses a different font for the same document that misaligns the word locations relative to each other in the documents, the process can take into consideration the differences in the fonts/font sizes to compare the documents even though the word locations are different (i.e., a font causes a word object or picture object to move to the next line because the font uses a bigger character size than another font). The process will calculate a different location based on the different font/font size. This process can be accomplished across multiple pages within a document.

Identification of documents with misaligned objects can be initially identified based on word counts between the documents. Even though one of the documents is misaligned, the process of counting words still works in the same manner.

An algorithm is applied to generate a score for each of the documents in step 710. The score is generated based on at least a number of occurrences of similar objects between the documents in step 710. The process compares the scores of the documents to identify a template document in step 712. The process clusters the documents based on the template document in step 714. The process ends in step 716.

FIG. 8 is a flow diagram of a process for reconstructing a template document. The process described in FIG. 8 goes between steps 710 and step 712 of FIG. 7. After applying the algorithm in step 710, the process determines an amount of certainty for one or more occurrences of similar objects in the identified locations of the objects between the documents in step 800. The process identifies, in step 802, common object document location(s) based on a minimum certainty threshold value. The process determines, in step 804, if the template document contains an error(s) at an individual object in the common object document location(s).

If there is not an error in the template document in step 806, the process goes to step 712. If there is an error in the template document in step 806, the process replaces the individual object(s) in the template document at the common document locations with a second object(s). The second object(s) is from the common object location(s) of a second one of the documents that has been determined to be correct. The process goes to step 712.

FIG. 9 is a diagram of two exemplary documents 101F-101G that are used to determine a template document 102. The documents 101F-101G are examples of two invoices from a vendor (Company ABC) that have been sent to Company XYZ for goods shipped to Company XYZ. Document 101F is an invoice document that does not contain any errors. Document 101G is an invoice document that contains several errors that have occurred as a result of being scanned. In the sentence “Monthly invoice to Company XYZ for goods shipped to Company XYZ for the period of:” the word “to” has been scanned as “ta.” In addition, a user has left a note 900 attached to the document 101F that covers part of Company ABC's address when document 101F was scanned. The note 900 also introduces additional text of a “Note to Billing.” These are some common examples of how errors can be introduced as part of the scanning process.

The above described process can be used compare the documents 101F and 101G along with other documents to generate scores of the documents 101F-101G. The scores can then be used to identify a template document 102. In this example, the document 101F would be identified as the template document because it has fewer errors than the document 102G.

Of course, various changes and modifications to the illustrative embodiment described above will be apparent to those skilled in the art. These changes and modifications can be made without departing from the spirit and the scope of the system and method and without diminishing its attendant advantages. The following claims specify the scope of the invention. Those skilled in the art will appreciate that the features described above can be combined in various ways to form multiple variations of the invention. As a result, the invention is not limited to the specific embodiments described above, but only by the following claims and their equivalents.

INVENTORS:

Ghessassi, Karim

THIS PATENT IS REFERENCED BY THESE PATENTS:

Patent	Priority	Assignee	Title
ER8074,

THIS PATENT REFERENCES THESE PATENTS:

Patent	Priority	Assignee	Title
7430717,	Sep 26 2000	International Business Machines Corporation	Method for adapting a K-means text clustering to emerging data
7742953,	Apr 01 2004	Kyocera Corporation	Adding information or functionality to a rendered document via association with an electronic counterpart
8005720,	Feb 15 2004	Kyocera Corporation	Applying scanned information to identify content
8249871,	Nov 18 2005	Microsoft Technology Licensing, LLC	Word clustering for input data
8509525,	Apr 06 2011	GOOGLE LLC	Clustering of forms from large-scale scanned-document collection
8843494,	Mar 28 2012	Open Text Corporation	Method and system for using keywords to merge document clusters
8954440,	Apr 09 2010	Walmart Apollo, LLC	Selectively delivering an article
9507758,	Jul 03 2013	ICEBOX INC ,	Collaborative matter management and analysis
20020083079,
20060116994,
20070189615,
20070217701,
20090043824,
20100312797,
20110137900,
20130054620,
20130304742,
20140029857,

ASSIGNMENT RECORDS Assignment records on the USPTO

Executed on	Assignor	Assignee	Conveyance	Frame	Reel	Doc
Feb 25 2014	GHESSASSI, KARIM	Digitech Systems Private Reserve, LLC	ASSIGNMENT OF ASSIGNORS INTEREST SEE DOCUMENT FOR DETAILS	032366	0346	pdf
Feb 28 2014		Digitech Systems Private Reserve, LLC	(assignment on the face of the patent)

MAINTENANCE FEES AND DATES: Maintenance records on the USPTO

Date	Maintenance Fee Events
Feb 20 2025	M2551: Payment of Maintenance Fee, 4th Yr, Small Entity.

Date	Maintenance Schedule
Sep 21 2024	4 years fee payment window open
Mar 21 2025	6 months grace period start (w surcharge)
Sep 21 2025	patent expiry (for year 4)
Sep 21 2027	2 years to revive unintentionally abandoned end. (for year 4)
Sep 21 2028	8 years fee payment window open
Mar 21 2029	6 months grace period start (w surcharge)
Sep 21 2029	patent expiry (for year 8)
Sep 21 2031	2 years to revive unintentionally abandoned end. (for year 8)
Sep 21 2032	12 years fee payment window open
Mar 21 2033	6 months grace period start (w surcharge)
Sep 21 2033	patent expiry (for year 12)
Sep 21 2035	2 years to revive unintentionally abandoned end. (for year 12)