Representative implementations of devices and techniques provide a system for categorizing electronically stored information without the need for user input, direction, or guidance. In an implementation, the system determines meanings of input textual data items and groups of textual data items, identifies equivalent meanings between textual data items and between groups of textual data items, and outputs user-selected information that is categorized, indexed, and searchable.
|
12. A method, comprising:
importing multiple data items into a computational system arranged to automatically organize portions of the data items into a searchable form;
parsing the multiple data items into components, and the components into textual tokens;
identifying textual tokens having an equivalent meaning based on contextual relationships of the textual tokens within the components of the multiple data items, the relationships including a quantity of textual tokens or textual groups, a position within a set of textual tokens or textual groups, an occurrence of a textual token or textual group in a plurality of components and/or data items, and a sequence of textual tokens or textual groups;
generating a thesaurus tool, based on the equivalent meaning and the relationships;
categorizing the components of the multiple data items based on the identifying and the thesaurus tool, the categorizing including indexing and storing the components;
analyzing the textual tokens to resolve content ambiguity, to refine relationships, and/or to refine categorization;
reviewing the categorization using the indexing, and marking selected textual tokens and/or components;
concurrently propagating the marking throughout the multiple data items to textual tokens and/or components having an equivalent meaning; and
exporting a portion of the multiple data items and/or the components as determined by a user-selected category, index, and/or relationship.
3. A system, comprising:
one or more processors;
an import module arranged to import multiple electronic data items;
a memory hardware device communicatively coupled to the one or more processors;
a content categorization module stored in the memory hardware device and operative on the one or more processors to:
parse the multiple data items into components, and the components into textual tokens;
identify textual tokens having an equivalent meaning based on contextual relationships of the textual tokens within the components of the multiple data items, the relationships including a quantity of textual tokens or textual groups, a position within a set of textual tokens or textual groups, an occurrence of a textual token or textual group in a plurality of components and/or data items, and a sequence of textual tokens or textual groups;
generate a thesaurus tool, based on the equivalent meaning and the relationships;
categorize the components of the multiple data items based on the identifying and the thesaurus tool, the categorizing including indexing and storing the components;
analyze the textual tokens to resolve content ambiguity, to refine relationships, and/or to refine categorization;
review the categorization using the indexing, and marking selected textual tokens and/or components; and
concurrently propagate the marking throughout the multiple data items to textual tokens and/or components having an equivalent meaning; and
an output device arranged to export a portion of the multiple data items and/or the components as determined by a user-selected category, index, and/or relationship.
1. Computer-readable storage media, having computer-executable instructions stored thereon, that when executed, cause one or more computer processors to initiate a process, comprising:
importing multiple data items into a computational system arranged to automatically organize portions of the data items into a searchable form, the data items comprising an email, an electronic file, an electronic document, text message, data from a database, and/or content of a web page;
parsing the multiple data items into components comprising text, images, and/or metadata, and the components into textual tokens comprising single textual characters, symbols, digits, and/or punctuation;
grouping textual tokens into textual groups comprising sets of textual tokens, words, phrases, misspelled words, and/or foreign language words;
identifying textual tokens and textual groups having an equivalent meaning based on contextual relationships of the textual tokens and the textual groups within the components of the multiple data items, the relationships including a quantity of textual tokens or textual groups, a position within a set of textual tokens or textual groups, an occurrence of a textual token or textual group in a plurality of components and/or data items, and a sequence of textual tokens or textual groups;
generating a thesaurus tool of textual tokens and/or textual groups, based on the equivalent meaning and equivalent relationships among and between textual tokens and textual groups;
categorizing the components of the multiple data items based on the identifying and the thesaurus tool, the categorizing including indexing and storing the data items, the components, the textual tokens, and/or the textual groups;
analyzing the textual tokens and textual groups to resolve content ambiguity, to refine relationships, and/or to refine categorization;
reviewing the categorization using the indexing, and marking selected textual tokens, textual groups, and/or components;
concurrently propagating the marking throughout the multiple data items to textual tokens, textual groups, and/or components having an equivalent meaning; and
exporting a portion of the multiple data items and/or the components as determined by user-selected categories, indices, and/or relationships, the exporting including creating one or more reports and/or outputting textual tokens, textual groups, and/or components having an equivalent meaning to a user interface of the computational system.
2. The computer-readable storage media of
4. The system of
5. The system of
6. The system of
7. The system of
8. The system of
9. The system of
10. The system of
11. The system of
13. The method of
14. The method of
15. The method of
16. The method of
17. The method of
18. The method of
19. The method of
20. The method of
|
This application claims the benefit under 35 U.S.C. §119(e)(1) of U.S. Provisional Application No. 61/893,372, filed Oct. 21, 2013, which is hereby incorporated by reference in its entirety.
Various methods exist for analyzing data of all sorts, including electronic documents, for instance, via a technology-based system. However, many of these methods require information, direction, specification, or example documents to be reviewed by a human user and submitted to the system prior to any automated document analysis. Relying on the user input, these systems can use existing methods to search through un-reviewed documents to find documents that in some way “match” the information, direction, specification, or example documents provided to the system by the user.
In such systems, the accuracy and efficiency of the system generally relies on the quality (and often the quantity) of the information, direction, specification, or example documents provided. For example, in some cases, the information provided by a human user may not result in the optimal search results. Additionally, gathering the best example documents for submission to the system can be a time consuming and otherwise inefficient process in itself.
The detailed description is set forth with reference to the accompanying figures. In the figures, the left-most digit(s) of a reference number identifies the figure in which the reference number first appears. The use of the same reference numbers in different figures indicates similar or identical items.
For this discussion, the devices and systems illustrated in the figures are shown as having a multiplicity of components. Various implementations of devices and/or systems, as described herein, may include fewer components and remain within the scope of the disclosure. Alternately, other implementations of devices and/or systems may include additional components, or various combinations of the described components, and remain within the scope of the disclosure.
Introduction
Representative implementations of devices and techniques provide a system for categorizing electronically stored information without the need for user input, direction, or guidance. In various embodiments, a combination of system components uses natural language analysis, information theory techniques, and/or the like, to concurrently analyze multiple textual data items (e.g., textual tokens) and multiple textual data item groups (e.g., words, phrases, textual token groups and/or sequences, etc.) with a hardware-software co-design. In an implementation, the system determines possible meanings of the textual data items and groups of textual data items, and identifies equivalent meanings between textual data items and between groups of textual data items.
In an embodiment, equivalent meanings are utilized to place electronically stored information into content categories identified by meaning. In another implementation, relationships between content categories are also identified concurrently with content categorization. In the implementation, once data items have been analyzed, categorized, and relationships identified, the system performs an analysis (e.g., utilizing parallel processing in some embodiments to save time) to resolve any content ambiguity and to refine relationships.
In an implementation, the content categories and the identified relationships are reviewed (e.g., by a user) upon completion of categorization. For example, such review can allow the user to refine the assigned content categorization and/or the identification of relationships between content categories of the electronically stored information. User review may include marking data items, text, or categories of electronically stored information.
In an implementation, the user can locate, review, and mark electronically stored information of interest based on content categories and relationships between data items, text, and categories without the need to manually review each and every individual electronically stored information item. At any time, for instance at the conclusion of review, the user can export electronically stored information as data items in one or more formats, categories of electronically stored information in multiple formats, relationships, data, and measurements between and among categories, electronically stored information, and data items. In an embodiment, the user can export the information quickly, confident that unselected electronically stored information does not contain content of interest and that textual data categories selected for export will provide the information needed by the user.
In various implementations, the disclosed techniques and systems are arranged to identify sequences of textual tokens (text characters, digits, symbols, punctuation, or other single character textual input), groups of textual tokens, words, phrases, sentences, etc. across multiple data sources, which have equivalent meaning, based entirely on relationships captured from the data set. There are no predefined token, group, word, phrase, or sentence relationships, no preset user inquiries, and no seed sets needed to train the system. The system identifies sequences of words with equivalent meaning without dependence on exact word matches. With such equivalencies a user gains the following functional benefits:
In the various implementations, sequences of token groups, (e.g., words, etc.) may, but need not be, directly adjacent token groups as there can be token groups between the token groups in the sequence. The system determines which sequences have equivalent meaning based on:
In various implementations, the disclosed techniques and systems are also arranged to:
The above procedures, techniques, and results are examples, and are not intended to be limiting, but are illustrations for the purposes of discussion. In alternate implementations, variations of the above procedures and techniques can be used to obtain desired, like, or similar results. Further, in some embodiments, the procedures and techniques may not include all of the provisions, or may include more or alternate provisions, and obtain the same or similar results.
Implementations are explained in more detail below using a plurality of examples. Although various implementations and examples are discussed here and below, further implementations and examples may be possible by combining the features and elements of individual implementations and examples.
Example Categorization Execution System
Referring to
In an implementation, the system 100 includes an Electronically Stored Information (ESI) Import Module 110 which imports one or more items of Electronic Data 150, 152 . . . 158 into the system 100. In various embodiments, the data items 150, 152 . . . 158 may be local to the execution system 100 or available through a wired or wireless network. Accordingly, in the various embodiments, the import module 110 is arranged to import the data items 150, 152 . . . 158 via the wired or wireless network (for example, via the Internet, an intranet, a LAN, WAN, etc.). The import module 110 may import millions, billions, or even trillions of items of Electronic Data 150, 152 . . . 158 as represented in
In various embodiments, Electronic Data 150, 152 . . . 158 comprises data items such as electronic documents, electronic mail, text messages, data from a database, or any electronically stored information (e.g., content of a web page, etc.). Such data items 150, 152 . . . 158 can be stored locally, remotely, or acquired through a stream of data provided through a communication channel, or in a way that allows data items 150, 152 . . . 158 to be made available to the Computational System 120.
In an implementation, as shown in
In an implementation, the Content Categorization Application 130 is allocated to multiple (or single) processor(s). In an embodiment, the Content Categorization Application 130 performs analysis routines via the Computational System 120. For example, in an implementation, the Content Categorization Application 130 comprises processor-executable instructions, that when executed on the one or more processors of the Computational System 120, the Content Categorization Application 130 performs one or more analysis routines (as further described below) to analyze and categorize Electronic data items 150, 152 . . . 158. In an implementation, the Content Categorization Application 130 transforms Electronic data items 150, 152 . . . 158 from a set of text and other data into a categorized organization of data based on the meanings of text and other data.
In an implementation, the Content Categorization Application 130 may utilize multiple processors (if available) of the Multiple Processor Computational System 120 to produce the Categorized and Indexed Electronic Data 160. In an embodiment, the Categorized and Indexed Electronic Data 160 includes items and relationships between data items 150, 152 . . . 158, electronically stored information, and categories, as well as other data, which can be stored, for example, prior to exporting. Such other data may include indices capturing relationships among and between data items 150, 152 . . . 158, electronically stored information, and categories, as well as other data (e.g., metadata, etc.). Such indices may be used to find data items 150, 152 . . . 158, electronically stored information, and categories, as well as other data quickly.
The Categorized and Indexed Electronic Data 160 can be exported in a variety of forms, formats, and general representations. Exported Categorized and Indexed Electronic Data 160 can include reports specifying the size, content, analytics, information theory calculations, review history, and other information useful to the user in understanding and describing the exported Categorized and Indexed Electronic Data 160. In an implementation, exported data produces for a user a data set based on the meanings of the text and other data within the set of imported data items 150, 152 . . . 158. The transformation from imported data to categorized data allows the user to export data based on meaning of text and data within the data item set.
In an embodiment, the Content Categorization Application 130 stores category and relationship data as well as analytics and information theory measurements (such as entropy and probabilities, for example) in the Category and Relationship Database 140. In various implementations, the database 140 may be a commercial database, custom database, file repository, or other organization of data and data relationships. In an embodiment, the Categorization Execution System 100 transforms structured and unstructured data into an organized set of content categories containing electronically stored information, data items 150, 152 . . . 158, and other data with equivalent meaning. This transformation results in electronically stored information, data items 150, 152 . . . 158, and other data organized by the meaning of the textual data items 150, 152 . . . 158 and not the specific textual data items 150, 152 . . . 158 present.
In alternate implementations, a Categorization Execution System 100 may include fewer components, additional components, or alternate components to perform the functions discussed herein, or for other desired functionality.
Example Computational System
Referring to
In an implementation, as shown in
In an implementation, as shown in
In an implementation, one or more Visualization Devices 250 may be included with, attached to, or available through a bus, communication network, or other communication mechanism, to the computational system 120. In some examples, devices 240 and 250 may provide data items 150, 152 . . . 158 to the ESI import module 110.
Multiple types of permanent storage may be included in the computational system 120. Permanent Fixed Storage 260 and Permanent Mobile Storage 270 are examples. Permanent Fixed Storage 260 may include, but is not limited to, hard drives, servers, and other types of storage primarily meant to be fixed in place, but can be moved if needed, and intended for storage between executions of the embodiment. Permanent Mobile Storage 270 includes hard drives, flash memory, disks, tapes, and other types of storage primarily meant to be mobile (to be moved from place to place yet permanently store data), but can be fixed in place if needed, and intended for storage between executions of the system 100. For example, the Category and Relationship Database 140 and/or the Content Categorization Application 130 may be stored on the storage devices 260 and/or 270.
In an implementation, the computational system 120 may include Peripheral Devices 280, such as fingerprint readers, bar code readers, mobile computing devices such as phones, and all other such hardware devices that can serve as input, output, or storage devices.
In various implementations, the computational system 120 may comprise one of many types and designs of mobile devices capable of receiving and sending messages (such as text messages, multimedia messaging service (MMS) messages, enhanced messaging service (EMS) messages, short message service (SMS) messages, and the like), displaying text and/or graphics, producing audible tones, displaying video, and the like. In some implementations, the mobile device 102 may comprise such devices including, but not limited to: a mobile phone, a smart phone, a tablet device, a set top box, a personal digital assistant (PDA), or the like.
In an implementation, the mobile device may include a User Interface and/or display 250, one or more Processors (CPU, GPU, etc.) 210, 212 . . . 218, an Output Device 230, and a Memory 260, 270. Each of these components may be coupled to a bus structure, such that each component is capable of communicating with or transferring data to and/or from the other components. In various implementations, the Memory 260, 270 may be fully integrated to the mobile device (one or more integrated memory storage devices), or a portion of the Memory 260, 270 may be portable, removable, remote, or the like (such as a memory storage expansion “SD card,” or similar).
In one implementation, the Memory 260, 270 stores a mobile operating system (OS) and one or more mobile applications (“apps”) such as the Content Categorization Application 130. Additionally, the Memory 260, 270 may also store data for the OS or the apps in a Database 140, or other storage organization type.
In alternate implementations, a mobile device may include fewer components, additional components, or alternate components to perform the functions discussed, or for other desired functionality.
Example Implementations
In various implementations, electronically stored information (ESI) comprises information within data items 150, 152 . . . 158. For example, ESI includes text, images, metadata, and other similar components of data items 150, 152 . . . 158. Data items 150, 152 . . . 158 are the source files that are imported by the electronically stored information import module 110, and include such items as document files, emails, records, web pages, and the like.
In the implementations, ESI are comprised of textual tokens, which include text characters, digits, symbols, punctuation, or other single character textual input. In an embodiment, the system 100 groups textual tokens into textual groups for analysis. For example, the groups of textual tokens (which may include words, phrases and/or sentences, as well as sets, sequences, and arrangements, etc. of textual tokens) may be analyzed to find equivalent meaning across multiple textual groups, as discussed further below. In an example, indices can be used to maintain these equivalencies as well as the connection to the data items 150, 152 . . . 158 where the textual groups occur.
Referring now to
In an embodiment, content analysis may be performed by Textual Group Content Identification Module 320, a module that identifies the meaning of textual tokens and group of textual tokens using one or more thesaurus files created during analysis of the data set. For example, in an embodiment, the thesaurus files are not preloaded from an existing source, but are generated from analyzing the imported data 150, 152 . . . 158. In an implementation, content of a textual token group is identified by the sequence of textual tokens and their positional relationship to each other.
In order to identify multiple textual token groups within ESI with equivalent meaning, some embodiments use an Equivalent Textual Group Matching Module 330. Textual token groups have the same meaning if the textual tokens in each position within each textual token group have equivalent meaning or are variations of the same textual token (e.g., run and ran). Two textual token groups are identified as equivalent if the two textual token groups occur in the same position within two sequences of textual token groups, where the surrounding textual token groups have been identified as equivalent.
Once equivalent textual token groups have been identified, some embodiments utilize Common ESI Identification Module 340 integrating natural language processing and information theory to identify common categories of ESI across multiple data items 150, 152 . . . 158, where common ESI may occur in multiple data items and data items may have multiple common ESI. Some embodiments utilize a database and indices to capture, store, and organize common ESI category—data item relationships for fast and accurate retrieval of these relationships.
An example embodiment utilizes an ESI Relationship Identification Module 350 to identify multiple, distinct ESI as being related based on common meaning across multiple data items 150, 152 . . . 158. The relationships identified allows the embodiment to quickly use indices to identify ESI relationships across the data items 150, 152 . . . 158 which provide further meaning of the content of multiple data items 150, 152 . . . 158 as a whole.
Continuing with
Entropy and information theory calculations may be used in an embodiment by an ESI Category Ranking Module 370 to rank the categories based on statistical confidence, information diversity, and/or other calculations. Entropy may include measurement of the diversity of textual tokens and textual token groups within ESI and other data items 150, 152 . . . 158, frequency of textual tokens, textual token groups, textual token groups, and textual token group sequences. Less diverse textual token groups and\or textual token group usage may indicate more equivalent meaning across ESI and other data items 150, 152 . . . 158 and thus increase the confidence in equivalent meaning across textual token groups. In some embodiments the confidence in equivalent meaning drives the rankings presented to the user, with other useful information about categories of ESI and other data items 150, 152 . . . 158.
In some embodiments a User Category Review Module 380 allows one or more users to mark (via an input device 240, a user interface 250, or the like) categories of equivalent textual token groups, individual textual token groups, and individual data items 150, 152 . . . 158 with designations useful to the user. Such designations then can be propagated through the data items 150, 152 . . . 158 utilizing stored indices to designate equivalent textual token groups, categories of equivalent textual token groups, and data items 150, 152 . . . 158 with the same or a related marking.
Moving on to
In some embodiments, as the Import Electronic Data 410 module accesses ESI, one or more Categorize a Portion of Electronic Data 420, 422 . . . 428 steps begin the categorization of electronically stored data. The Categorize of Electronic Data 420, 422 . . . 428 steps may perform functions concurrently in order to categorize electronically stored data quickly. Once a user has designated textual token groups, ESI, or data items 150, 152 . . . 158 with a marking, in some embodiments that marking may be propagated to newly imported data items 150, 152 . . . 158 during import by applying the marking to equivalent textual token groups, ESI, and data items 150, 152 . . . 158.
Similarly, in this embodiment the Store Category Data 430, 432 . . . 438 steps perform storage functions concurrently, although this could be done sequentially when implemented on a single processor. While each Categorize of Electronic Data 420, 422, . . . 428 action operates on a distinct portion of ESI, the Store Category Data 430, 432, . . . 438 steps may perform storage for one or more Categorize a Portion of Electronic Data 420, 422, . . . 428 steps. Parallel storage may be performed without delaying categorization of newly imported ESI by Categorize a Portion of Electronic Data 420, 422 . . . 428 steps. Sequential storage operations may be used in embodiments with a single processor.
Continuing with
Additionally, as shown in
In an embodiment, after review and marking, the user may export designated categories, textual token groups, and/or data items 150, 152 . . . 158 using the Export Selected Electronic Data 460 step. The export may include textual tokens, textual token groups, categories, and data items 150, 152 . . . 158 in a variety of electronic formats supplemented by data collected or calculated during analysis or user actions. An audit report may also be exported which provides information as to user actions within the system 100 during use.
An embodiment may include an Import ESI to Categorize 510 step, which accesses files, directories, repositories, devices, and, in general, any storage where ESI may be located and in a variety of electronic formats. In an implementation, the ESI is read or otherwise imported (via the ESI import module 110) from one or more data items 150, 152 . . . 158. In this embodiment, such import reads text and other data from the data item(s) which may be grouped, analyzed, categorized, related, reviewed by one or more users, and exported in a variety of formats with associated analysis, review, and audit data. For example, a series of textual tokens imported from a data item might be: “Fred was involved in the NORCO contract” which results a series of textual tokens such as “Fred” “was” “involved” “in” “the” “NORCO” and “contract”.
As ESI is imported, an embodiment of the process 500 may include an ESI Content Identification 520 step, which identifies one or more types of content from ESI imported from each data item by placing textual tokens into groups, identifying equivalent textual tokens, and identifying equivalent groups of textual tokens, within each data item. In an embodiment, content identification allows for grouping of equivalent content based on textual token group meaning. Continuing the example, the series of textual tokens previously imported might be placed in a group such as {“Fred” “involved” “NORCO” “contract”}. If import resulted in two additional groups of textual tokens, such as {“FM” “handled” “NORCO” “contract”} and {“Fred” “worked” “NORCO” “agreement”}, then all three groups of textual tokens would be considered to have equivalent content.
Once content has been identified within ESI, an embodiment of the ESI Categorization and Storage 530 can be used to place textual tokens, textual token groups, ESI, and data items 150, 152 . . . 158 into categories where textual tokens, textual token groups, ESI, and data items 150, 152 . . . 158 have equivalent meaning. Categorization in some embodiments will store categorization data for each textual token, group of textual tokens, ESI, and data item within a database or other repository for later use. ESI and other data items 150, 152 . . . 158 may be placed in more than one category if textual data within the ESI or data item contains content with multiple meanings. Again continuing the example, the data items 150, 152 . . . 158 containing the textual token groupings {“Fred” “involved” “NORCO” “contract”}, {“FM” “handled” “NORCO” “contract”}, and {“Fred” “worked” “NORCO” “agreement”} would be placed in the same category of equivalent content.
Continuing with
In this embodiment the user has the ability to review textual token groups, categories, data items 150, 152 . . . 158, and the relationships between and among textual token groups, categories, and data items 150, 152 . . . 158 in the User Category Review 550 step. In this example step the user reviews by marking, confirming, re-categorizing, or eliminating categories from the analyzed data items 150, 152 . . . 158. Continuing the example, if the user marked the textual token group {“Fred” “involved” “NORCO” “contract”} as relevant, all textual token groups within the category that included {“Fred” “involved” “NORCO” “contract”} would be marked relevant, thus marking {“FM” “handled” “NORCO” “contract”}, and {“Fred” “worked” “NORCO” “agreement”} relevant as well as the all data items 150, 152 . . . 158 containing these textual word groups or their equivalents.
In this embodiment the user can initiate the ESI Export 560 step to export the categories and ESI needed for production, import into another software tool, or storable for further use. In this embodiment the export can be in a variety of formats including, but not limited to, native format matching the import data item format, in PDF format, or a variety of other data formats as needed. This example embodiment utilizes textual token group, category, and data item markings performed by the user and accomplished by propagation of user markings, or both. If the user selects all relevant data items 150, 152 . . . 158 for export, data items 150, 152 . . . 158 in categories marked relevant will be selected for export, whether the textual word groups or data items 150, 152 . . . 158 in those categories were directly marked by a user action or marked by propagating a user marking to equivalent textual word groups, categories, or data items 150, 152 . . . 158. Following the ESI Export 560 step within the Document Categorization Workflow 500, the workflow completes with Done 570.
As example embodiment in
Continuing with
In this embodiment the ESI Textual Grouping Process 620 . . . 628 steps analyze textual data by position within a sequence of textual tokens and in relation to other textual data in one or more data items 150, 152 . . . 158 to identify a textual token group containing two or more textual tokens. In this embodiment the text within a data item is imported as a sequence of textual tokens which are grouped into a textual token group with meaning when grouped together. Some textual tokens such as “a,” “an,” and “the,” among others, can be ignored when grouping textual tokens from a data item. While this embodiment benefits from parallel processing, some embodiments using a single processor can implement this step.
Again continuing with
Equivalencies, when found in this embodiment will be stored for use in Equivalent Content Processes 640 . . . 648 steps. In this embodiment Equivalent Content Processes 640 . . . 648 steps determine equivalent textual token group content across textual token groups comprised of different textual tokens with equivalent meaning or different textual token sequences using equivalent textual tokens, or symmetrical textual token positioning, textual token type or frequency, and other data gathered in previous steps.
In an embodiment, Equivalent Content Process 640 . . . 648 steps identify equivalent content in ESI Common Content Process 650 . . . 658 steps. In some embodiments the ESI Common Content Process 650 . . . 658 identify common content across data items 150, 152 . . . 158 from equivalent textual tokens and textual token groups. Once ESI Common Content Process 650 . . . 658 steps are complete, the Content Categorization Sub-workflow: ESI Content Identification 600 completes. Completion of ESI Content Identification 600 in this embodiment results in identifying multiple categories of content within ESI and data items 150, 152 . . . 158. In addition, equivalent content may be found in multiple, different ESI and data items 150, 152 . . . 158. The transformation of textual token sequences to equivalent textual token groups, equivalent contents, and content relationships across ESI and data items 150, 152 . . . 158 ready for content categorization of ESI and data items 150, 152 . . . 158.
In an implementation, an embodiment of ESI Categorization and Storage 700 begins with ESI Content Identification 520 step as depicted in
In this embodiment of ESI Category Identification Processes 720 . . . 728 steps, identified common content is grouped into a set of categories based on content of the textual tokens and textual token textual token groups within the ESI imported. In some embodiments, these categories represent high level concepts common to more than one data item. These categories of concepts are represented by multiple groups of textual tokens using equivalent textual tokens, equivalent groups of textual tokens, and positional information of textual tokens within groups.
Continuing with
In this embodiment, upon completion of one or more ESI Category Storage Processes 730 . . . 738, the Identify ESI Category Relationships 740 step begins identification of category relationships by considering temporal, positional, and equivalence data derived from textual tokens and textual token textual token groups as well as metadata from data items 150, 152 . . . 158. In this embodiment, relationships are assessed using analytic and entropy calculations including probabilities, frequencies, and entropy calculations. These calculations may be used to rank textual token groups and categories as to likelihood equivalent meaning, and provide confidence measurement of the consistency of user markings during review.
Concluding
An embodiment of Analytics and Entropy Calculation 800 begins with conclusion of ESI Categorization and Storage 530 depicted in
In the embodiment shown in
Continuing with
In an embodiment, Calculate Category Analytics and Entropy 840 step makes use of updated textual token, textual token group, and ESI calculations to revise category analytics and entropy. In some embodiments such revision may be minor when the calculations performed previously in Calculate ESI Analytics and Entropy 830 can be combined from or already exist from previous analytics and entropy calculations for one or more identified categories. In some embodiments the Calculate Category Analytics and Entropy 840 step normalizes calculations by utilizing unique textual token, unique textual token group, and ESI data which eliminates the influence of duplicative textual tokens, duplication textual token groups, and duplicative ESI on frequency and probability calculations. In some embodiments this step serves as an accuracy and precision check for previously calculated and stored analytics and entropy.
Continuing to describe the example embodiment shown in
Referring to
In an implementation, a user can provide input (via input device 240, for example) and revise textual token, textual token group, categories, ESI, and data item relationships during review by marking textual tokens, textual token groups, categories, ESI, and data items 150, 152 . . . 158. In this example embodiment a user may move textual token groups into or out of equivalent categories to more accurately reflect equivalent meaning. A user may also revise relationships between and among textual tokens, textual token groups, categories, ESI, and data items 150, 152 . . . 158. In this embodiment, such revisions will propagate through the data set, changing equivalencies between textual tokens, textual token groups, categories, ESI, and data items 150, 152 . . . 158 to accurately reflect user actions. The transformation of data items 150, 152 . . . 158 into categories of textual content allows user actions to be reflected across the textual token groups, categories, ESI, and data items 150, 152 . . . 158.
In
As shown in
Referring to
In an implementation, Get User Export Specifications 1020 prompts the user for export specifications, performs a consistency check against user project requirements (such as required confidence levels, ESI review thoroughness, and other user defined project requirements) and then creates export specifications. These specifications in this embodiment provide the parameters needed to retrieve categories, ESI, and data items 150, 152 . . . 158 as well as analytics, entropy, and audit data.
Using the user-input export specification, the example embodiment of Retrieve User Specified ESI 1030 retrieves categories, ESI, and data items 150, 152 . . . 158 as well as analytics, entropy, and audit data from a repository which may be a database, internal memory, external permanent storage, or other data storage medium. In this embodiment the export specification will be placed in the repository for auditing or other archival purposes.
The example embodiment of the Calculate Export ESI Analytics & Entropy 1040 step analyzes the export specification, exported data, and exported and data items 150, 152 . . . 158, as well as other needed information, in order to calculate analytics and entropy measurements specific to the exported ESI and data items 150, 152 . . . 158. In this embodiment the analytics and entropy measurements may include percentage of textual tokens, textual token groups, categories, ESI, and data items 150, 152 . . . 158 exported, percentage of textual tokens, textual token groups, categories, ESI items, and data items 150, 152 . . . 158 exported that were user reviewed, marked, remarked, and revised, as well as analytics such as confidence level in user marking of each textual token group, category, ESI item, and data item 150, 152 . . . 158, raw data as to the markings of textual word groups, categories, ESI items, and data items 150, 152 . . . 158, as well as other user requested measurements.
Once Calculate Export ESI Analytics & Entropy 1040 completes, the example embodiment of the Export ESI 1050 step accomplishes the task of exporting the user selected ESI and data items 150, 152 . . . 158. In this embodiment such export will produce ESI and data items 150, 152 . . . 158 in user specified formats, including plain text, word processing formats such Microsoft Word, PDF, or other user defined file format.
The example embodiment of ESI Export 1000 method ends with the Create Exported ESI Report 1050 step. An ESI report provides a user friendly and informative report describing user review actions, textual word group, category, ESI, and data item markings, textual word groups, categories, ESI items, and data items 150, 152 . . . 158 exported, and analytics and entropy measures from data items 150, 152 . . . 158 and the export specifications.
Various embodiments allow for alteration in implementation of steps and the sequence of steps described herein. An embodiment may be implemented that alters one or more of the methods described herein, methods and steps may be removed or replaced with other steps and still be within the scope of the disclosure. Any of the steps or methods described herein may be combined, moved, or modified for other methods described herein and still be within the scope of the disclosure. In an embodiment the elements, steps, or methods of the
Some of the previously described methods and steps may be composed of instructions stored on a storage media, permanent fixed or permanent mobile, or any other such media. The instructions may be retrieved and executed in a computing environment such as the multiprocessor or single processor computing environment.
It is noted here that in the computing environment embodying the disclosure may be reconfigured or repurposed to implement various embodiments providing an environment for transforming textual data items 150, 152 . . . 158 into categorized content allowing users to search, sort, review, or otherwise interact with the data items 150, 152 . . . 158 based on the meaning of textual tokens within the data items 150, 152 . . . 158. While some embodiments take advantage of multiprocessor computing environments for faster execution, an embodiment can be executed effectively on a single processor computing environment, and therefore will still be within the scope of this disclosure.
The previous method and step descriptions are illustrative and not restrictive in any sense, interpretation, or meaning. The scope of this disclosure should not be limited in any way by the embodiments described herein. Instead, the embodiments described herein should be understood to include the ability to categorize content generally, without the need for user input or direction as to desired outcomes, and functional ability to perform these and other generally described methods and steps quickly, accurately, and completely in a single or multiprocessor computing environment. The present descriptions of this disclosure should be further understood to cover modifications, alternatives, and equivalent methods, steps, and functions within the spirit and purpose of the disclosure.
Portions of the subject matter of this disclosure can be implemented as a system, method, apparatus, or article of manufacture using standard programming and/or engineering techniques to produce software, firmware, hardware or any combination thereof to control a computer or processor (such as included in computational system 120, for example) to implement the disclosure. For example, portions of an example system 100 may be implemented using any form of computer-readable media (shown as fixed and mobile storage 260, 270 in
Computer-readable storage media includes volatile and nonvolatile, removable and non-removable media implemented in any method or technology for storage of information such as computer-readable instructions, data structures, program modules or other data. Memory (permanent fixed storage) 260 is an example of computer-readable storage media. Permanent mobile storage 270, which may comprise local, network, or cloud storage, for example, is another example of computer-readable storage media. Additional types of computer-readable storage media that may be present include, but are not limited to, RAM, ROM, EEPROM, flash memory or other memory technology, CD-ROM, digital versatile disks (DVD) or other optical storage, magnetic disks or other magnetic storage devices, or any other medium which may be used to store the desired information and which may accessed by the processors 210, 212 . . . 218.
In contrast, communication media typically embodies computer readable instructions, data structures, program modules, or other data in a modulated data signal, such as a carrier wave, or other transport mechanism.
While the subject matter has been described above in the general context of computer-executable instructions of a computer program that runs on a computer and/or computers, those skilled in the art will recognize that the subject matter also may be implemented in combination with other program modules. Generally, program modules include routines, programs, components, data structures, and the like, which perform particular tasks and/or implement particular abstract data types.
Moreover, those skilled in the art will appreciate that the innovative techniques can be practiced with other computer system configurations, including single-processor or multiprocessor computer systems, mini-computing devices, mainframe computers, as well as personal computers, hand-held or mobile computing devices, microprocessor-based or programmable consumer or industrial electronics, and the like. The illustrated aspects may also be practiced in distributed computing environments where tasks are performed by remote processing devices that are linked through a communications network. However, some, if not all aspects of the disclosure can be practiced on stand-alone computers. In a distributed computing environment, program modules may be located in both local and remote memory storage devices.
As discussed above, the techniques, components, and devices described herein with respect to the implementations are not limited to the illustrations of
Conclusion
While various discreet embodiments have been described throughout, the individual features of the various embodiments may be combined to form other embodiments not specifically described. The embodiments formed by combining the features of described embodiments are also within the scope of the disclosure.
Patent | Priority | Assignee | Title |
11176126, | Jul 30 2018 | entigenlogic LLC | Generating a reliable response to a query |
11720558, | Jul 30 2018 | entigenlogic LLC | Generating a timely response to a query |
11748563, | Jul 30 2018 | entigenlogic LLC | Identifying utilization of intellectual property |
Patent | Priority | Assignee | Title |
6363174, | Dec 28 1998 | Sony Corporation; Sony Electronics, Inc.; Sony Electronics, INC | Method and apparatus for content identification and categorization of textual data |
7185001, | Oct 04 2000 | Torch Concepts | Systems and methods for document searching and organizing |
7213205, | Jun 04 1999 | Seiko Epson Corporation | Document categorizing method, document categorizing apparatus, and storage medium on which a document categorization program is stored |
7702665, | Jun 14 2005 | Microsoft Technology Licensing, LLC | Methods and apparatus for evaluating semantic proximity |
8286240, | Apr 23 2007 | Huawei Technologies Co., Ltd. | Method and system for content categorization |
8510832, | Apr 23 2007 | HUAWEI TECHNOLOGIES CO , LTD | Method and system for content categorization |
9098487, | Nov 29 2012 | Hewlett-Packard Development Company, L.P. | Categorization based on word distance |
20040088157, | |||
20050065947, | |||
20060282257, | |||
20070106499, | |||
20090276850, | |||
20150161144, | |||
20150254332, |
Executed on | Assignor | Assignee | Conveyance | Frame | Reel | Doc |
Date | Maintenance Fee Events |
Aug 23 2021 | REM: Maintenance Fee Reminder Mailed. |
Feb 07 2022 | EXP: Patent Expired for Failure to Pay Maintenance Fees. |
Date | Maintenance Schedule |
Jan 02 2021 | 4 years fee payment window open |
Jul 02 2021 | 6 months grace period start (w surcharge) |
Jan 02 2022 | patent expiry (for year 4) |
Jan 02 2024 | 2 years to revive unintentionally abandoned end. (for year 4) |
Jan 02 2025 | 8 years fee payment window open |
Jul 02 2025 | 6 months grace period start (w surcharge) |
Jan 02 2026 | patent expiry (for year 8) |
Jan 02 2028 | 2 years to revive unintentionally abandoned end. (for year 8) |
Jan 02 2029 | 12 years fee payment window open |
Jul 02 2029 | 6 months grace period start (w surcharge) |
Jan 02 2030 | patent expiry (for year 12) |
Jan 02 2032 | 2 years to revive unintentionally abandoned end. (for year 12) |