A system, a computer readable storage medium including instructions, and method for generating genre models used to identify genres of a document. For each document image in a set of document images that are associated with one or more genres, the document image is segmented into a plurality of tiles, wherein the tiles in the plurality of tiles are sized so that document page features are identifiable, and features of the document image and the plurality of tiles are computed. At least one genre classifier is trained to classify document images as being associated with one or more genres based on the features of the document images in the set of document images, the features of the plurality of tiles of the set of documents images, and the one or more genres associated with each document image in the set of documents images.
|
11. A computer readable storage medium storing one or more programs configured for execution by a computer, the one or more programs comprising instructions to:
for each document image in a set of document images that are associated with one or more genres,
segment the document image into a plurality of tiles, wherein the tiles in the plurality of tiles are sized so that document page features are identifiable; and
compute features of the document image and the plurality of tiles; and
train at least one genre classifier to classify document images as being associated with one or more genres based on the features of the document images in the set of document images, the features of the plurality of tiles of the set of documents images, and the one or more genres associated with each document image in the set of documents images, wherein the instructions to train the at least one genre classifier to classify document images as being associated with a respective genre in the one or more genres include instructions to:
train a first genre classifier corresponding to the respective genre based on the features of a first subset of the set of document images and the features of the plurality of tiles associated with the first subset of the set of document images;
tune parameters of the first genre classifier using a second subset of the set of document images, wherein the first subset and the second subset of the set of document images are mutually-exclusive sets of document images;
train a second genre classifier corresponding to the respective genre based on the features of a second subset of the set of document images and the features of the plurality of tiles associated with the second subset of the set of document images; and
tune parameters of the second genre classifier using the first subset of the set of document images.
1. A computer-implemented method for generating genre models used to identify genres of a document, comprising:
on a computer system having one or more processors executing one or more programs stored on memory of the computer system:
for each document image in a set of document images that are associated with one or more genres,
segmenting the document image into a plurality of tiles, wherein the tiles in the plurality of tiles are sized so that document page features are identifiable; and
computing features of the document image and the plurality of tiles; and
training at least one genre classifier to classify document images as being associated with one or more genres based on the features of the document images in the set of document images, the features of the plurality of tiles of the set of documents images, and the one or more genres associated with each document image in the set of documents images, wherein training the at least one genre classifier to classify document images as being associated with a respective genre in the one or more genres includes:
training a first genre classifier corresponding to the respective genre based on the features of a first subset of the set of document images and the features of the plurality of tiles associated with the first subset of the set of document images;
tuning parameters of the first genre classifier using a second subset of the set of document images, wherein the first subset and the second subset of the set of document images are mutually-exclusive sets of document images;
training a second genre classifier corresponding to the respective genre based on the features of a second subset of the set of document images and the features of the plurality of tiles associated with the second subset of the set of document images; and
tuning parameters of the second genre classifier using the first subset of the set of document images.
12. A computer-implemented method for identifying genres of a document, comprising:
on a computer system having one or more processors executing one or more programs stored on memory of the computer system:
receiving a document image of the document;
segmenting the document image into a plurality of tiles of the document image, wherein the tiles in the plurality of tiles are sized so that document page features are identifiable;
computing features of the document image and the plurality of tiles; and
identifying one or more genres associated with the document image based on the features of the document image and the features of the plurality of tiles, wherein identifying the one or more genres associated with the document image based on the features of the document image and the features of the plurality of tiles of the document image includes:
applying a first set of genre classifiers to the features of the document image and the features of the plurality of tiles associated with the document image to produce a first set of scores, wherein the first set of genre classifiers is trained based on a first subset of training document images, and wherein parameters of the first set of genre classifiers are tuned based on a second subset of the training document images;
applying a second set of genre classifiers to the features of the document image and the plurality of tiles associated with the document image to produce a second set of scores, wherein the second set of genre classifiers is trained based on the second subset of the training document images and
wherein parameters of the second set of genre classifiers are tuned based on the first subset of the training document images;
combining the first set of scores and the second set of scores to produce a combined set of scores; and
identifying the one or more genres associated with the document image based on the combined set of scores.
19. A computer readable storage medium storing one or more programs configured for execution by a computer, the one or more programs comprising instructions to:
receive a document image of the document;
segment the document image into a plurality of tiles of the document image, wherein the tiles in the plurality of tiles are sized so that document page features are identifiable;
compute features of the document image and the plurality of tiles of the document image; and
identify one or more genres associated with the document image based on the features of the document image and the features of the plurality of tiles of the document image, wherein the instructions to identify the one or more genres associated with the document image based on the features of the document image and the features of the plurality of tiles of the document image include instructions to:
apply a first set of genre classifiers to the features of the document image and the features of the plurality of tiles associated with the document image to produce a first set of scores, wherein the first set of genre classifiers is trained based on a first subset of training document images, and wherein parameters of the first set of genre classifiers are tuned based on a second subset of the training document images;
apply a second set of genre classifiers to the features of the document image and the plurality of tiles associated with the document image to produce a second set of scores, wherein the second set of genre classifiers is trained based on the second subset of the training document images, and wherein parameters of the second set of genre classifiers are tuned based on the first subset of the training document images;
combine the first set of scores and the second set of scores to produce a combined set of scores; and
identify the one or more genres associated with the document image based on the combined set of scores.
20. An imaging system, comprising:
one or more processors;
memory; and
one or more programs stored in the memory, the one or more programs comprising instructions to:
receive a document image of a document;
segment the document image into a plurality of tiles of the document image, wherein the tiles in the plurality of tiles are sized so that document page features are identifiable;
compute features of the document image and the plurality of tiles of the document image; and
identify one or more genres associated with the document image based on the features of the document image and the features of the plurality of tiles of the document image, wherein the instructions to identify the one or more genres associated with the document image based on the features of the document image and the features of the plurality of tiles of the document image include instructions to:
apply a first set of genre classifiers to the features of the document image and the features of the plurality of tiles associated with the document image to produce a first set of scores, wherein the first set of genre classifiers is trained based on a first subset of training document images, and wherein parameters of the first set of genre classifiers are tuned based on a second subset of the training document images;
apply a second set of genre classifiers to the features of the document image and the plurality of tiles associated with the document image to produce a second set of scores, wherein the second set of genre classifiers is trained based on the second subset of the training document images, and wherein parameters of the second set of genre classifiers are tuned based on the first subset of the training document images;
combine the first set of scores and the second set of scores to produce a combined set of scores; and
identify the one or more genres associated with the document image based on the combined set of scores.
2. A computer-implemented method for generating genre models used to identify genres of a document, comprising:
on a computer system having one or more processors executing one or more programs stored on memory of the computer system:
for each document image in a set of document images that are associated with one or more genres,
segmenting the document image into a plurality of tiles, wherein the tiles in the plurality of tiles are sized so that document page features are identifiable; and
computing features of the document image and the plurality of tiles; and
training at least one genre classifier to classify document images as being associated with one or more genres based on the features of the document images in the set of document images, the features of the plurality of tiles of the set of documents images, and the one or more genres associated with each document image in the set of documents images, wherein training the at least one genre classifier to classify document images as being associated with a respective genre in the one or more genres includes:
for each genre in at least a subset of genres associated with the document images in the set of document images,
selecting a subset of tiles from the set of document images, wherein each tile in the subset of tiles is associated with the genre; and
clustering tiles in the subset of tiles based on the features of the tiles; and
generating a probability model for the genre, wherein the probability model for the genre indicates a likelihood that a respective feature of a respective tile is a member of a cluster of the genre, wherein the probability model is included in a set of probability models, each of which corresponds to a genre in the subset of genres;
for at least a subset of document images in the set of document images, applying probability models to the subset of document images and the plurality of tiles associated with the subset of document images to produce a set of probabilities that respective document images in the subset of document images are members of one or more genres; and
training the respective genre classifier to classify a respective document image as being associated with the respective genre based on the set of probabilities and the one or more genres associated with each document image in the subset of document images.
3. The computer-implemented method of
4. The computer-implemented method of
document page features; and
tile features.
5. The computer-implemented method of
the number of columns of a respective page;
the number of horizontal lines of the respective page;
the number of vertical lines of the respective page;
a histogram of horizontal line lengths of the respective page;
a histogram of vertical line lengths of the respective page;
a page size of the respective page; and
the number of pages of a document.
6. The computer-implemented method of
a density of a respective tile;
a number of rows of text of the respective tile;
an average font size of text of the respective tile;
a median font size of text of the respective tile;
a histogram of row widths of the respective tile;
a subset of values from a color correlogram of the respective tile; and
an physical location of the respective tile in a document image.
7. The computer-implemented method of
8. The computer-implemented method of
9. The computer-implemented method of
10. The computer-implemented method of
13. The computer-implemented method of
14. The computer-implemented method of
15. The computer-implemented method of
16. The computer-implemented method of
17. The computer-implemented method of
a copier;
a scanner;
a facsimile machine;
a digital camera;
a camcorder; and
a mobile phone.
18. The computer-implemented method of
21. The imaging system of
22. The imaging system of
a copier;
a scanner;
a facsimile machine;
a digital camera;
a camcorder; and
a mobile phone.
23. The imaging system of
|
The disclosed embodiments relate generally to classifying documents. More specifically, the disclosed embodiments relate to systems and methods for identifying document genres.
As more business is being conducted electronically, documents are increasingly being converted into electronic form. For example, documents may be scanned by a document scanner to produce an electronic document including digital images of the documents. Electronic documents are beneficial because they require less physical space then paper documents. Furthermore, electronic documents can be easily backed up to prevent accidental loss.
However, as the volume of electronic documents increases, it becomes more difficult to organize the documents. Manually organizing the documents is burdensome and inefficient. One solution to the problem is to perform optical character recognition (OCR) on the electronic documents to extract text in the electronic documents. The extracted text may then be analyzed to determine and/or classify the content of the electronic documents. For example, the content may be classified by topics (e.g., an electronic document may include information about George Washington's birthplace and therefore may be classified under the topic of “George Washington”). Unfortunately, OCR techniques are computationally expensive.
Thus, it is highly desirable to classify documents without the aforementioned problems.
Some embodiments provide a system, a computer readable storage medium including instructions, and computer-implemented method for generating genre models used to identify genres of a document. For each document image in a set of document images that are associated with one or more genres, the document image is segmented into a plurality of tiles, wherein the tiles in the plurality of tiles are sized so that document page features (e.g., lines of text in a tiles, etc.) are identifiable, and features of the document image and the plurality of tiles are computed. At least one genre classifier is trained to classify document images as being associated with one or more genres based on the features of the document images in the set of document images, the features of the plurality of tiles of the set of documents images, and the one or more genres associated with each document image in the set of documents images.
In some embodiments, a first one of the at least one genre classifiers is trained to classify document images as being associated with a first genre as follows. A subset of document images is identified from the set of documents images, each document image in the subset of document images being associated with the first genre. The first genre classifier corresponding to the first genre is trained based on the features of the document images the features of the plurality of tiles associated with the document images, and information indicating which of the document images correspond to the identified subset of document images associated with the first genre. For at least a subset of the document images in the set of document images, the set of genre classifiers are applied to each of the document images and the plurality of tiles associated with the document images to produce a set of scores. For each genre, a second genre classifier corresponding to the first genre is trained to classify document images as being associated with the first genre based on the set of scores for each document image, the one or more genres associated with each document image, and a location of tiles in the plurality of tiles for each document image.
In some embodiments, a first one of the at least one genre classifiers is trained to classify a respective document image as being associated with one or more genres by performing the following operations for each genre in at least a subset of genres associated with the document images in the set of document images: (1) a subset of tiles from the set of document images is selected, wherein each tile in the subset of tiles is associated with the genre, (2) tiles in the subset of tiles are clustered based on the features of the tiles, and (3) a probability model is generated for the genre, wherein the probability model for the genre indicates a likelihood that a respective feature of a respective tile is a member of a cluster of the genre, wherein the probability model is included in a set of probability models, each of which corresponds to a genre in the subset of genres. For at least a subset of document images in the set of document images, probability models are applied to the subset of document images and the plurality of tiles associated with the subset of document images to produce a set of probabilities that respective document images in the subset of document images are members of one or more genres. The first genre classifier is trained to classify the respective document image as being associated with one or more genres based on the set of probabilities and the one or more genres associated with each document image in the subset of document images.
In some embodiments, a first one of the at least one genre classifiers is trained to classify document images as being associated with a first genre as follows. The first genre classifier corresponding to the first genre is trained based on (1) the features (e.g., document page features and tile features, as described below) of a first subset of the set of document images and (2) the features of the plurality of tiles associated with the first subset of the set of document images. Parameters of the first genre classifier are tuned using a second subset of the set of document images, wherein the first subset and the second subset of the set of document images are mutually-exclusive sets of document images. A second genre classifier corresponding to the first genre is trained based on the features of a second subset of the set of document images and the features of the plurality of tiles associated with the second subset of the set of document images. Parameters of the second genre classifier are tuned using the first subset of the set of document images.
Some embodiments provide a system, a computer readable storage medium including instructions, and computer-implemented method for identifying genres of a document. A document image of the document is received. The document image is segmented into a plurality of tiles of the document image, wherein the tiles in the plurality of tiles are sized so that document features (e.g., the number of text lines, font height, etc.) are identifiable. Features of the document image and the plurality of tiles are computed. One or more genres associated with the document image are identified based on the features of the document image and the features of the plurality of tiles.
In some embodiments, the one or more genres associated with the document image are identified based on the features of the document image and the features of the plurality of tiles of the document image as follows. A first set of genre classifiers is applied to the features of the document image and the plurality of tiles associated with the document image to produce a set of scores. A second set of genre classifiers is applied to the set of scores of the document image to identify the one or more genres associated with the document image.
In some embodiments, the one or more genres associated with the document image are identified based on the features of the document image and the features of the plurality of tiles of the document image as follows. For each genre, a likelihood that the features of the document image and the features of the plurality of tiles of the document image are members of a cluster of the genre is computed based on a probability model of the genre. A genre classifier is applied to the computed likelihoods to identify the one or more genres associated with the document image.
In some embodiments, the one or more genres associated with the document image are identified based on the features of the document image and the features of the plurality of tiles of the document image as follows. A first set of genre classifiers is applied to the features of the document image and the plurality of tiles associated with the document image to produce a first set of scores. A second set of genre classifiers is applied to the features of the document image and the plurality of tiles associated with the document image to produce a second set of scores. The first set of scores and the second set of scores are combined to produce a combined set of scores. The one or more genres associated with the document image are identified based on the combined set of scores.
Like reference numerals refer to corresponding parts throughout the drawings.
As discussed above, documents are often classified according to topics. However, there are other techniques of classifying documents that provide useful information (e.g., metadata) that may be used when indexing, organizing, searching for, or displaying contents (e.g., ads) based on scanned documents. For example, genres may be used to classify documents. Thus, in some embodiments, documents are classified by genres. Genres may include: advertisements, brochures, casual papers (e.g., newsletters, magazine articles, etc.), flyers, forms, maps, formal papers (e.g., journal papers, etc.), photos, receipts, rules and regulations, reports, resumes, tables, etc. In some embodiments, the documents are indexed by both topic and genre. For example, if documents are indexed by both topic and genre, then a tourist may, for example, search for brochures about geysers in Yellowstone, while a science student may search for papers about geysers in Yellowstone. Similarly, during advertisement placement, if it is recognized that a brochure is being scanned, and the brochure has words such as “Hawaii,” “sand,” and “island,” then advertisements for tourist services in Hawaii or travel agents specializing Hawaiian vacations may be presented to the user scanning the brochure.
In some embodiments, genres may be characterized by “style,” “form,” and “content.” “Style” corresponds to the structural content, such as the use of punctuation, sentences, and phrases. For example, an editorial has a different style than formal prose, which in turn has a different style than poetry. “Form” includes the structural layout of a document, such as location and number of columns, headings, graphs, and font size. For imaged/scanned documents, form is usually identified using structural layout analysis (e.g., see T. Breuel, “High Performance Document Layout Analysis,” Proc. Symposium on Document Image Understanding Technology, 2003, which is hereby incorporated by reference in its entirety). “Content” refers to the meaning or semantic values in a document, such as the presence of terms and objects in the document.
In some embodiments, genre identification based on features from different modalities (e.g., style, form, and content) is used. These embodiments may be used when computation time and/or complexity is not an issue. However, when computational time and/or complexity are constraints, it is desirable to reduce the modalities used. Thus, in some embodiments, genres associated with an imaged document are based on “form.” In these embodiments, image-based features that can be computed relatively efficiently and relatively robustly are used. Furthermore, layout analysis is not performed. The imaged documents may be captured from hardware such as a document scanner, camera, video camera, facsimile machine, copier, etc. In the case of a camera or video camera, if the image contains other objects in the background, the image may be preprocessed to identify the portion of the image that includes a document page image (e.g., see C. H. Lampert, T. Braun, A. Ulges, D. Keysers, and T. M. Breuel, “Oblivious document capture and realtime retrieval”, Proc. CBDAR2005, pp. 79-86, 2005, for a discussion on preprocessing images). In some embodiments, the classification system described herein may also include “style” and “content” type features. These embodiments may require the use of OCR.
In some embodiments, image features are used to identify genres associated with documents. In these embodiments, the underlying, or latent, types of page regions are identified. These latent types of page regions intuitively correspond to types such as text, photo, column, large font, rules, etc. In some embodiments, Gaussian mixture models are used to assign region label probabilities that correspond to the probability that a given region is of a given latent type (e.g., see N. Rasiwasia and N. Vasconcelos, “Scene classification with low-dimensional semantic spaces and weak supervision,” Proc. IEEE Conference on Computer Vision and Pattern Recognition, Anchorage, June 2008, for an overview on Gaussian mixture models, which is hereby incorporated by reference in its entirety). The region label probabilities are then used as inputs to a classifier that is trained to identify genres of a document.
Genre identification may be performed by a genre-identification system as described herein. In some embodiments, the genre-identification system addresses one or more of the following issues:
In some embodiments, to address the first issue, the genre-identification system described herein uses “one-against-many” classifiers. That is, for each genre to be identified, a separate classifier is trained to discriminate that genre from all other genres. In some embodiments, to address the second issue, the genre-identification system uses the concept of latent spaces, which correspond to types of document regions (e.g., body text, title text, etc.). In some embodiments, to address the third issue, a classifier that combines the identified genres for each page of a document (e.g., via multiple instance learning or voting) is used. The different page regions may be handled by the use of a latent space.
In some embodiments, a classification system identifies the one or more genres 108 associated with the images 106 corresponding to the documents 102. The classification system may be included in the imaging system 104 or may be located on a separate device (e.g., a server, etc.). If the classification system is located on a separate device, the images 106 may be transmitted to the separate device through a network (e.g., network 120). Alternatively, the images 106 may be delivered to the separate device using physical media (e.g., CD ROMs, DVDs, flash drives, floppy disks, hard disks, etc.). The classification system is described in more detail with respect to
In some embodiments, the one or more genres 108 are used to display genre specific content 110 on a display device of the imaging system 104. For example, if the one or more genres 108 associated with the images 106 include resumes, the genre specific content 110 may include advertisements for job websites or contact information for recruiters.
In some embodiments, the imaging system 104 queries a server 130 via a network 120 using the one or more genres 108 to obtain the genre specific content 110 from the server 130. The network 120 can generally include any type of wired or wireless communication channel capable of coupling together computing nodes. This includes, but is not limited to, a local area network, a wide area network, or a combination of networks. In some embodiments, the network 120 includes the Internet.
In some embodiments, the imaging system 104 includes an imaging device such as a copier, a scanner, a facsimile machine, a digital camera, a camcorder, and a mobile phone. In these embodiments, the imaging device produces a digital image of the document.
In some embodiments, the one or more genres 108 are used to tag the documents 102 (e.g., using metadata). These tags may then be used filter and/or sort the documents 102 (e.g., via a query against the tags). Furthermore, tags may be used to organize and/or file the documents 102 (e.g., placing the documents 102 in specified folders, etc.).
In some embodiments, prior to using the classification system, genre classifiers of the classification system are trained on a training system during a training phase using a set of training documents. The set of training documents may already be tagged with one or more genres. Alternatively, the set of training documents may be untagged. In this case, the set of training documents are manually tagged (e.g., by a user, etc.) during the training phase. The training system is described in more detail with respect to
Three training and classification techniques are described below. The first technique is described with respect to
Attention is now directed to the first training and classification technique.
Note that the term genre classifier and support vector machine (SVM) are used interchangeably in this specification to refer to a classifier that can identify genres of document images as described herein.
In some embodiments, the classification system and the training system are included in the same system. For example, the classification system and the training system can be included an imaging system (e.g., the imaging system 104 in
The operations of the training stage 401 are performed prior to the operations of the classification stage 402.
The training stage 401 begins when the training system receives (404) training documents and associated genres. As described above, each training document may be associated with one or more genres. The training system scans (406) the training documents to produce a set of document images 407. Alternatively, if the training documents have already been scanned, step 406 is omitted.
In some image-based techniques for identifying genres of documents, layout analysis is used to label and identify the boundaries of different types of document regions (e.g., text, image, ruled, graphics). Features are then extracted based on layout analysis. However, layout analysis is computationally expensive and error-prone. Furthermore, these layout analysis techniques use “small” tiles (e.g., 8 pixel by 8 pixel tiles). Similarly, some image-based techniques for identifying genres of documents identify salient points and perform classification based on the distribution of the features.
In contrast to these techniques, some embodiments segment each page of a document into tiles and extract features for each tile. In some embodiments, the tiles cover all parts of the page. Furthermore, the tiles may overlap each other (e.g., each tile may overlap adjacent tiles by half of a tile). A “page” tile that includes an entire page may also be produced. Moreover, these embodiments use “large” tiles (e.g., 25 tiles for each page).
Thus, for each document image in the set of document images 407, the training system segments (408) the document image into a plurality of tiles 409. In some embodiments, the training system segments the document image into the plurality of tiles 409 so that document page features (e.g., the number of lines of text, font height, etc.) are identifiable.
Attention is now directed to
Returning to
Image density may be computed by converting a page image to a binary image and summing the number of black pixels in each tile.
Horizontal lines may be computed by computing run lengths of black pixels in a black and white image horizontally, allowing for short pixel jogs horizontally or vertically (e.g., see K. Y. Wong, R. G. Casey, F. M. Wahl, “Document Analysis System,” IBM Journal of Research and Development, 1982, which is hereby incorporated by reference in its entirety). The number of lines in each tile is noted and the line lengths may be quantized into a histogram. In some embodiments, “logarithmic” quantization bins are used. For example, the quantization bins for the line lengths may be separated into bins as follows: a first bin that includes line lengths between a half of the width of the page and a full width of the, a second bin that includes line lengths between a quarter of the width of the page and a half of the width of the page, . . . , to a fifth bin that includes line lengths less than one-thirty-seconds of the width of the page, for a total of five bins. Vertical line histograms may be computed similarly. Other quantization bins for line lengths can be used in other embodiments.
While extracting tile features such as the number of rows of text and the average and/or median font size, the pixels may be projected horizontally and the text rows may be identified and statistically characterized. This technique is referred to as “projection.”
In some embodiments, the images are proportionally scaled to a maximum 1550 pixels in the horizontal and vertical direction, and then the color correlogram is computed. Feature selection may be performed to reduce the number of dimensions using minimum Redundancy Maximum Relevance (mRMR) Feature Selection (e.g., see H. Peng, F. Long, and C. Ding, “Feature selection based on mutual information: criteria of max-dependency, max-relevance, and min-redundancy,” IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 27, no. 8, pp. 1226-1238, 2005, which is hereby incorporated by reference in its entirety). Since feature values depend in part on tile location (e.g., titles usually appear at the top of a page), the feature selection technique maintains information about the location of tiles. Thus, feature selection may be performed on vectors formed by concatenating tile features. To use these features when clustering tiles, the locations of the selected features in its tile may be used, so that the features are the union of locations in the tiles.
Returning to
Attention is now directed to
In some embodiments, the clustering operation (e.g., step 504) is performed on the computed features for any of the embodiments described with reference to
Note that the tiles are clustered so that the groups roughly correspond to different types of tiles, such as image, text, graphics, large font, or white space. Thus, rather than performing layout analysis, each of the tiles is implicitly “labeled” with image types, where the labeling may be weighted.
For at least a subset of document images in the set of document images 407, the training system applies (508) the probability models to the subset of document images and the plurality of tiles associated with the subset of document images to produce a set of probabilities that respective document images in the subset of document images are members of one or more genres.
The training system then trains (510, 720) the at least one genre classifier (e.g., trained SVMs/genre classifiers 414 in
An alternative technique is to compute the probability of each casual label as illustrated in
where:
P(x|t)=ΣβtjG(x,μtj,Φtj).
This choice reduces the dimensionality of the representation which in turn accelerates SVM training and testing. In each case, the x and y location of the tiles can be added as features to encourage use of tile location information. The classifiers are then trained on labeled feature data. To perform genre identification of a new document page, a set of features is computed for the page and then the probability of each genre is computed. These probabilities are used to derive features for training an SVM classifier.
Each classifier has been trained to identify one genre, with exemplars labeled with the genre used as positive examples, and all other exemplars labeled as negative examples. The different types of genre realizations are implicitly handled by using a max-margin classifier, such as an SVM, and possibly a kernel function that allows for possibly non-contiguous regions, such as a radial basis function (RBF).
Returning to
The classification system segments (420) the document image 419 into a plurality of tiles 421 and computes (422) features 423 of the document image 419 and the plurality of tiles 421. In some embodiments, the classification system segments the document image into the plurality of tiles 421 so that document page features (e.g., the number of lines of text, font height, etc.) are identifiable. The classification system then identifies (424) one or more genres 425 associated with the document image 419 based on the features 423 of the document image 419 and the features 423 of the plurality of tiles 421.
Attention is now directed to
Returning to
In some embodiments, after the classification system obtains the genre specific content, the classification system transmits an electronic message including the genre specific content to one or more specified users. For example, the classification system may transmit the electronic message to the specified users via an electronic mail message, short messaging service (SMS) message, a multimedia messaging service (MMS) message, etc.
In some embodiments, a subset of the document images are associated with a document that includes multiple pages. In these embodiments, the training stage 401 and the classification stage 402 may are performed on each page of the document.
Attention is now directed to the second training and classification technique. Note that the description above relating to features, segmenting the document also apply to the second technique described below.
The operations of the training stage 431 are performed prior to the operations of the classification stage 432.
The training stage 431 begins when the training system receives (434) training documents and associated genres. As described above, each training document may be associated with one or more genres. The training system scans (436) the training documents to produce a set of document images 437. Alternatively, if the training documents have already been scanned, step 436 is omitted.
For each document image in the set of document images 437, the training system segments (438) the document image into a plurality of tiles 439 and computes (440) the features of the document image and the plurality of tiles 439. In some embodiments, the training system segments the document image into the plurality of tiles 439 so that document page features (e.g., the number of lines of text, font height, etc.) are identifiable.
The training system then trains (442) at least one genre classifier to classify the document images as being associated with a genre based on the features of the document images in the set of document images 437, the features of the plurality of tiles of the set of document images 437, and the one or more genres associated with each document image in the set of document images 437.
Attention is now directed to
For at least a subset of the document images in the set of document images 437, the training system applies (536) the first set of genre classifiers (e.g., the trained first set of SVMs/genre classifiers 443 in
For each genre, the training system trains (538) a second genre classifier corresponding to the genre to classify document images as being associated with the genre based on the set of scores for each document image in the subset of document images, the one or more genres associated with each document image, and a location of tiles in the plurality of tiles for each document image. Thus, a second set of genre classifiers including the second genre classifier for each genre is produced (e.g., the trained second set of SVMs/genre classifiers 444 in
Returning to
The classification system segments (450) the document image 449 into a plurality of tiles 451 and computes (452) features 453 of the document image 449 and the plurality of tiles 451. In some embodiments, the classification system segments the document image into the plurality of tiles 451 so that document page features (e.g., the number of lines of text, font height, etc.) are identifiable. The classification system then identifies (454) one or more genres 455 associated with the document image 449 based on the features 453 of the document image 449 and the features 453 of the plurality of tiles 451.
Attention is now directed to
Alternatively, the scores produced by first set of genre classifiers for each tile may be used in a voting paradigm to identify page genres.
Returning to
In some embodiments, after the classification system obtains the genre specific content, the classification system transmits an electronic message including the genre specific content to one or more specified users. For example, the classification system may transmit the electronic message to the specified users via an electronic mail message, short messaging service (SMS) message, a multimedia messaging service (MMS) message, etc.
In some embodiments, a subset of the document images are associated with a document that includes multiple pages. In these embodiments, the training stage 431 and the classification stage 432 may are performed on each page of the document.
Attention is now directed to the third training and classification technique. Note that the description above relating to features, segmenting the document also apply to the third technique described below.
The operations of the training stage 461 are performed prior to the operations of the classification stage 462.
The training stage 461 begins when the training system receives (464) training documents and associated genres. As described above, each training document may be associated with one or more genres. The training system scans (466) the training documents to produce a set of document images 467. Alternatively, if the training documents have already been scanned, step 466 is omitted.
For each document image in the set of document images 467, the training system segments (468) the document image into a plurality of tiles 469 and computes (470) the features of the document image and the plurality of tiles 469. In some embodiments, the training system segments the document image into the plurality of tiles 469 so that document page features (e.g., the number of lines of text, font height, etc.) are identifiable.
The training system then trains (472) at least one genre classifier to classify the document images as being associated with a genre based on the features of the document images in the set of document images 467, the features of the plurality of tiles of the set of document images 467, and the one or more genres associated with each document image in the set of document images 467.
Attention is now directed to
Returning to
The classification system segments (480) the document image 479 into a plurality of tiles 481 and computes (482) features 483 of the document image 479 and the plurality of tiles 481. In some embodiments, the training system segments the document image into the plurality of tiles 481 so that document page features (e.g., the number of lines of text, font height, etc.) are identifiable. In some embodiments, the features 483 include the probabilities/likelihoods described above with respect to FIGS. 4A and 5A-5B. The classification system then identifies (484) one or more genres 485 associated with the document image 479 based on the features 483 of the document image 449 and the features 483 of the plurality of tiles 481.
Attention is now directed to
Returning to
In some embodiments, after the classification system obtains the genre specific content, the classification system transmits an electronic message including the genre specific content to one or more specified users. For example, the classification system may transmit the electronic message to the specified users via an electronic mail message, short messaging service (SMS) message, a multimedia messaging service (MMS) message, etc.
In some embodiments, a subset of the document images are associated with a document that includes multiple pages. In these embodiments, the training stage 461 and the classification stage 462 may are performed on each page of the document.
The methods described in
Each of the above identified elements may be stored in one or more of the previously mentioned memory devices, and corresponds to a set of instructions for performing a function described above. The set of instructions can be executed by one or more processors (e.g., the CPUs 902). The above identified modules or programs (i.e., sets of instructions) need not be implemented as separate software programs, procedures or modules, and thus various subsets of these modules may be combined or otherwise re-arranged in various embodiments. In some embodiments, memory 910 may store a subset of the modules and data structures identified above. Furthermore, memory 910 may store additional modules and data structures not described above.
Although
Each of the above identified elements may be stored in one or more of the previously mentioned memory devices, and corresponds to a set of instructions for performing a function described above. The set of instructions can be executed by one or more processors (e.g., the CPUs 1002). The above identified modules or programs (i.e., sets of instructions) need not be implemented as separate software programs, procedures or modules, and thus various subsets of these modules may be combined or otherwise re-arranged in various embodiments. In some embodiments, memory 1010 may store a subset of the modules and data structures identified above. Furthermore, memory 1010 may store additional modules and data structures not described above.
Although
In some embodiments, the training system 900 and the classification system 1000 are located on the same system (e.g., a copy machine, etc.). In some embodiments, the training system 900 and the classification system 1000 are located on separate systems. For example, the training system 900 may be located on a system of a manufacturer, whereas the classification system 100 may be located on an end user system.
Handling Weakly Labeled Data
In some embodiments, each document in the training set is manually classified into one of several genres. For example, these genres include: ads, brochures, casual papers, flyers, forms, maps, papers, photos, receipts, rules and regulations, reports, resumes, tables, etc. However, a document may be associated with more than one genre. For example, a one page invitation to a party in the form of a flyer may belong to both the “invitation” genre as well as the “party” genre. Thus, in some embodiments, the classification system described herein identifies one or more genres of a document.
By training the SVMs using a one-against-many model, a page may be classified into more than genre, which can be desirable depending on the application. Classification into a single class may be performed by any of the standard methods for multi-class SVMs, including classification into the class with the highest decision function value.
Evaluation
The classification system described herein was evaluated using data from 599 documents with a total of 3469 pages. The first 20 pages of a document were included if the document was longer than 20 pages. Each document was manually labeled with an appropriate genre.
The data was divided into three parts (train, development, and test) with approximately the same number of documents from a genre assigned to each part. Thus far, the train and development partitions have been used in these experiments.
A first experiment was performed where a Gaussian Mixture Model with eight components, characterized by mean and covariance, was computed for each genre. Each page of the development data was then classified into the class with the largest score after uniform voting by the tiles in the page. The results are shown in
A second experiment using latent classes and an SVM was performed. The latent classes were computed on the training partition and the class models were used to compute the class probabilities for each test page. The jackknife method was used on the development data, wherein the model is trained on all pages except those from one document. The trained model was then evaluated on the pages that were left out. The results for all documents were then combined. These results are summarized using accuracy (e.g., the degree to which the genre determined by the classification system matches the actual genre), precision (e.g., the number of pages correctly identified by the system as belonging to a particular genre divided by the total number of pages identified by the system as belonging to a particular genre), and recall measures (e.g., the number of pages correctly identified by the system as belonging to a particular genre divided by the total number of pages in a corpus that actually belong to a particular genre), as shown in
A third experiment was performed by comparing the techniques described herein with the techniques presented by Kim and Ross (Y. Kim and S. Ross, “Detecting family resemblance: Automated genre classification,” Data Science Journal, 6(2007), pp. S172-S183, 2007, which is hereby incorporated by reference in its entirety). The genres analyzed by Kim and Ross included scientific articles, which are similar to the category “papers” described herein. For their image-based genre classifier, Kim and Ross had a precision and recall of 0.21 and 0.80, respectively. Kim and Ross also analyzed business reports and reported a precision of 0.56 and recall of 0.636. Kim and Ross (Y. Kim and S. Ross, “Examining variations of prominent features in genre classification,” Proc. of the 41st Annual Hawaii International Conference on System Sciences, p. 132, 2008, which is hereby incorporated by reference in its entirety) computed precision and recall in two different datasets based on image features and reported the best result among three different classifiers, including an SVM for their second dataset. For the genre of business reports, Kim and Ross report precision and recall for their first dataset of 0.273 and 0.2, respectively. The precision and recall for business reports in Kim and Ross' second dataset was 0.385 and 0.05, respectively.
Based on the description in Kim and Ross (2007) and Kim and Ross (2008), an image classifier using a 62×62 grid was implemented, and each region with at least one pixel with a value less than 245 was assigned a value of ‘0’ and other regions were assigned a value of “1.” Two versions of the Weka Naïve Bayes classifier referenced in Kim and Ross (2007) were run on the dataset. The two versions were: (1) plain and (2) with kernel density estimation. For comparative evaluation, we computed F1, the harmonic mean of precision and recall often used in information retrieval (IR):
As can be observed in
A second corpus of pages labeled with zero or more of five genres was created. The five genres included a brochure genre, a map genre, a paper genre, a photo genre, and a table genre. Over 3000 pages and approximately 2000 labels were used. The corpus was split into three partitions with an approximately equal number of documents in each partition.
In some embodiments, for higher precision, a plurality of random partitions is created, pairs of classifiers trained and tuned on each partition, and the classification or decision function scores from the different partitions combined to identify one or more genres.
The foregoing description, for purpose of explanation, has been described with reference to specific embodiments. However, the illustrative discussions above are not intended to be exhaustive or to limit the invention to the precise forms disclosed. Many modifications and variations are possible in view of the above teachings. The embodiments were chosen and described in order to best explain the principles of the invention and its practical applications, to thereby enable others skilled in the art to best utilize the invention and various embodiments with various modifications as are suited to the particular use contemplated.
Chen, Francine R., Cooper, Matthew, Lu, Yijuan
Patent | Priority | Assignee | Title |
Patent | Priority | Assignee | Title |
5526443, | Oct 06 1994 | Xerox Corporation; Fuji Xerox Co., Ltd. | Method and apparatus for highlighting and categorizing documents using coded word tokens |
5943443, | Jun 26 1996 | Fuji Xerox Co., Ltd. | Method and apparatus for image based document processing |
5999664, | Nov 14 1997 | Xerox Corporation | System for searching a corpus of document images by user specified document layout components |
6185560, | Apr 15 1998 | Fidelity Information Services, LLC | System for automatically organizing data in accordance with pattern hierarchies therein |
6456738, | Jul 16 1998 | Ricoh Company, Ltd. | Method of and system for extracting predetermined elements from input document based upon model which is adaptively modified according to variable amount in the input document |
6542635, | Sep 08 1999 | WSOU Investments, LLC | Method for document comparison and classification using document image layout |
6751354, | Mar 11 1999 | FUJI XEROX CO , LTD ; Xerox Corporation | Methods and apparatuses for video segmentation, classification, and retrieval using image class statistical models |
7039856, | Sep 30 1998 | RICOH CO , LTD | Automatic document classification using text and images |
7756341, | Jun 30 2005 | Xerox Corporation | Generic visual categorization method and system |
7912246, | Oct 28 2002 | VIDEOMINING, LLC | Method and system for determining the age category of people based on facial images |
8041120, | Jun 26 2007 | Microsoft Technology Licensing, LLC | Unified digital ink recognition |
20020122596, | |||
20020138492, | |||
20040013302, | |||
20060153456, | |||
20060210133, | |||
20070217676, | |||
20070258648, | |||
20080152238, | |||
20080222093, | |||
20080275833, | |||
20080310737, | |||
20090182696, | |||
20090204703, | |||
20090208106, | |||
20090263010, |
Executed on | Assignor | Assignee | Conveyance | Frame | Reel | Doc |
May 06 2009 | CHEN, FRANCINE R | FUJI XEROX CO , LTD | ASSIGNMENT OF ASSIGNORS INTEREST SEE DOCUMENT FOR DETAILS | 022802 | /0861 | |
May 06 2009 | COOPER, MATTHEW | FUJI XEROX CO , LTD | ASSIGNMENT OF ASSIGNORS INTEREST SEE DOCUMENT FOR DETAILS | 022802 | /0861 | |
May 07 2009 | Fuji Xerox Co., Ltd. | (assignment on the face of the patent) | / | |||
Jun 08 2009 | LU, YIJUAN | FUJI XEROX CO , LTD | ASSIGNMENT OF ASSIGNORS INTEREST SEE DOCUMENT FOR DETAILS | 022802 | /0861 | |
Apr 01 2021 | FUJI XEROX CO , LTD | FUJIFILM Business Innovation Corp | CHANGE OF NAME SEE DOCUMENT FOR DETAILS | 058287 | /0056 |
Date | Maintenance Fee Events |
May 08 2013 | ASPN: Payor Number Assigned. |
Feb 17 2016 | M1551: Payment of Maintenance Fee, 4th Year, Large Entity. |
Feb 20 2020 | M1552: Payment of Maintenance Fee, 8th Year, Large Entity. |
Apr 22 2024 | REM: Maintenance Fee Reminder Mailed. |
Oct 07 2024 | EXP: Patent Expired for Failure to Pay Maintenance Fees. |
Date | Maintenance Schedule |
Sep 04 2015 | 4 years fee payment window open |
Mar 04 2016 | 6 months grace period start (w surcharge) |
Sep 04 2016 | patent expiry (for year 4) |
Sep 04 2018 | 2 years to revive unintentionally abandoned end. (for year 4) |
Sep 04 2019 | 8 years fee payment window open |
Mar 04 2020 | 6 months grace period start (w surcharge) |
Sep 04 2020 | patent expiry (for year 8) |
Sep 04 2022 | 2 years to revive unintentionally abandoned end. (for year 8) |
Sep 04 2023 | 12 years fee payment window open |
Mar 04 2024 | 6 months grace period start (w surcharge) |
Sep 04 2024 | patent expiry (for year 12) |
Sep 04 2026 | 2 years to revive unintentionally abandoned end. (for year 12) |