Methods, systems and computer program products for updating a word embedding model are provided. Aspects include receiving a first data set comprising a relational database having a plurality of words. Aspects also include generating a word embedding model comprising a plurality of word vectors by training a neural network using unsupervised machine learning based on the first data set. Each word vector of the plurality of word vector corresponds to a unique word of the plurality of words. Aspects also include storing the plurality of word vectors and a representation of a hidden layer of the neural network. Aspects also include receiving a second data set comprising data that has been added to the relational database. Aspects also include updating the word embedding model based on the second data set and the stored representation of the hidden layer of the neural network.
|
1. A computer-implemented method comprising:
receiving a first data set comprising a relational database having a plurality of words;
generating, by training a neural network using unsupervised machine learning based on the first data set, a word embedding model comprising a plurality of word vectors, each word vector of the plurality of word vectors corresponding to a unique word of the plurality of words;
storing the plurality of word vectors;
storing a representation of a hidden layer of the neural network;
receiving a second data set, wherein the second data set comprises data that has been added to the relational database; and
updating, based on the second data set and the stored representation of the hidden layer of the neural network, the word embedding model.
9. A system comprising:
a processor communicatively coupled to a memory, the processor configured to:
receive a first data set comprising a relational database having a plurality of words;
generate, by training a neural network using unsupervised machine learning based on the first data set, a word embedding model comprising a plurality of word vectors, each word vector of the plurality of word vectors corresponding to a unique word of the plurality of words;
store the plurality of word vectors;
store a representation of a hidden layer of the neural network;
receive a second data set, wherein the second data set comprises data that has been added to the relational database; and
update, based on the second data set and the stored representation of the hidden layer of the neural network, the word embedding model.
13. A computer program product comprising a computer readable storage medium having program instructions embodied therewith the program instructions executable by a computer processor to cause the computer processor to perform a method comprising:
receiving a first data set comprising a relational database having a plurality of words;
generating, by training a neural network using unsupervised machine learning based on the first data set, a word embedding model comprising a plurality of word vectors, each word vector of the plurality of word vectors corresponding to a unique word of the plurality of words;
storing the plurality of word vectors;
storing a representation of a hidden layer of the neural network;
receiving a second data set, wherein the second data set comprises data that has been added to the relational database; and
updating, based on the second data set and the stored representation of the hidden layer of the neural network, the word embedding model.
18. A computer-implemented method comprising:
receiving streaming data;
continuously storing the streaming data as it is received;
responsive to storing a first set of streaming data and determining that the first set of streaming data comprises an amount of data that exceeds a first predetermined threshold, generating, by training a neural network using unsupervised machine learning based on the first set of streaming data, a word embedding model comprising a plurality of word vectors, each word vector of the plurality of word vectors corresponding to a unique word of the plurality of words;
storing the plurality of word vectors;
storing a representation of a hidden layer of the neural network;
responsive to storing a second set of streaming data, determining that an amount of the second set of streaming data exceeds a second predetermined threshold, wherein the second set of streaming data is received chronologically after the first set of streaming data; and
updating, based on the second set of streaming data and the stored representation of the hidden layer of the neural network, the word embedding model to create a first updated word embedding model.
23. A system comprising:
a processor communicatively coupled to a memory, the processor configured to:
receive streaming data;
continuously store the streaming data as it is received;
responsive to storing a first set of streaming data and determining that the first set of streaming data comprises an amount of data that exceeds a first predetermined threshold, generate, by training a neural network using unsupervised machine learning based on the first set of streaming data, a word embedding model comprising a plurality of word vectors, each word vector of the plurality of word vectors corresponding to a unique word of the plurality of words;
store the plurality of word vectors;
store a representation of a hidden layer of the neural network;
responsive to storing a second set of streaming data, determine that an amount of the second set of streaming data exceeds a second predetermined threshold, wherein the second set of streaming data is received chronologically after the first set of streaming data; and
update, based on the second set of streaming data and the stored representation of the hidden layer of the neural network, the word embedding model to create a first updated word embedding model.
2. The computer-implemented method of
3. The computer-implemented method of
4. The computer-implemented method of
5. The computer-implemented method of
6. The computer-implemented method of
7. The computer-implemented method of
8. The computer-implemented method of
generating the word embedding model based on the first data set comprises applying selected parameters to the first data set and a training of the neural network; and
updating the word embedding model based on the second data set comprises applying the selected parameters to the second data set and an incremental training of the neural network.
10. The system of
11. The system of
12. The system of
14. The computer program product of
15. The computer program product of
16. The computer program product of
17. The computer program product of
19. The computer-implemented method of
training the neural network using unsupervised machine learning based on the first set of streaming data comprises determining one or more weights and biases associated with one or more neurons of the hidden layer of the neural network; and
storing a representation of the hidden layer of the neural network comprises storing the determined one or more weights and biases associated with the one or more neurons of the hidden layer.
20. The computer-implemented method of
21. The computer-implemented method of
storing a representation of an updated hidden layer;
responsive to storing a third set of streaming data, determining that an amount of the third set of streaming data exceeds the second predetermined threshold, wherein the third set of streaming data is received chronologically after the second set of streaming data; and
updating, based on the third set of streaming data and the stored representation of the updated hidden layer of the neural network, the first updated word embedding model to create a second updated word embedding model.
22. The computer-implemented method of
responsive to receiving a query of the word embedding model during streaming of the streaming data and before updating the word embedding model to create the first updated word embedding model, generating results of the query based on the word embedding model;
responsive to receiving a query of the word embedding model during streaming of the streaming data, after updating the word embedding model to create a first updated word embedding model and before updating the word embedding model to create the second updated word embedding model, generating results of the query based on the first updated word embedding model; and
responsive to receiving a query of the word embedding model during streaming of the streaming data and after updating the word embedding model to create the second updated word embedding model, generating results of the query based on the second updated word embedding model.
24. The system of
store a representation of an updated hidden layer;
responsive to storing a third set of streaming data, determine that an amount of the third set of streaming data exceeds the second predetermined threshold, wherein the third set of streaming data is received chronologically after the second set of streaming data; and
update, based on the third set of streaming data and the stored representation of the updated hidden layer of the neural network, the first updated word embedding model to create a second updated word embedding model.
25. The system of
responsive to receiving a query of the word embedding model during streaming of the streaming data and before updating the word embedding model to create the first updated word embedding model, generate results of the query based on the word embedding model;
responsive to receiving a query of the word embedding model during streaming of the streaming data, after updating the word embedding model to create a first updated word embedding model and before updating the word embedding model to create the second updated word embedding model, generate results of the query based on the first updated word embedding model; and
responsive to receiving a query of the word embedding model during streaming of the streaming data and after updating the word embedding model to create the second updated word embedding model, generate results of the query based on the second updated word embedding model.
|
The present invention generally relates to word embedding models, and more specifically, to dynamically updating a word embedding model.
Word embedding generally involves a set of language modeling and feature learning techniques in natural language processing (NLP) in which words and phrases from a vocabulary of words are mapped to vectors of real numbers (“word vectors”) comprising a word embedding model. Word embedding models may typically be generated by training a neural network using machine learning based on a data from, for example, a relational database. This process requires a large number of computations and thus, generally requires a large amount of processing resources and time to generate the resultant word embedding model. Once generated, the word embedding model may then be queried to reveal various relationships between data, such as for example, determining similarity between entities.
Conventionally, when new data is added to the relational database that served as the basis for the word embedding model, the model must be recreated by repeating the process of training the neural network with all of the data from the relational database. Thus, using conventional methods, a great amount of processing time and resources are expended every time a word embedding model is created to incorporate data that was newly added to the underlying relational database that forms the basis of the model. For example, it may take days to retrain the neural network with the augmented data set. In addition to adding significant development time to model generation and increased utilization of computing resources, such delays also decrease the usefulness of the word embedding models by preventing up-to-date queries from being run against the model. For example, in the time it takes to generate a new word embedding model that incorporates new data added to the underlying relational database, it is possible that more new data has since been added to the underlying relational database, which would mean the resultant word embedding model is not fully up-to-date. An inability to generate a new word embedding model with up-to-date data can limit word embedding model use in various applications, such as applications involving real-time or streaming data.
Embodiments of the present invention include methods, systems, and computer program products for updating a word embedding model. A non-limiting example of a computer-implemented method includes receiving a first data set comprising a relational database having a plurality of words. The method further includes generating a word embedding model comprising a plurality of word vectors by training a neural network using unsupervised machine learning based on the first data set. Each word vector of the plurality of word vectors corresponds to a unique word of the plurality of words. The method further includes storing the plurality of word vectors and a representation of a hidden layer of the neural network. The method further includes receiving a second data set that is data that has been added to the relational database. The method further includes updating the word embedding model based on the second data set and the stored representation of the hidden layer of the neural network. Advantages can include enabling model based queries without required the query target data to be pre-built into the model. Further advantages include dynamic updating of a word embedding model without incurring the large cost of allocating processing resources required to train the original model and avoidance of the significant time delay incurred by retraining the model.
In addition to one or more of the features described above or below, or as an alternative, further embodiments may include the relational database being a table comprising rows and columns and the second data set being a new row that has been added to the table. Advantages can also include providing support for updating of homogenous database data.
In addition to one or more of the features described above or below, or as an alternative, further embodiments may include that training the neural network using unsupervised machine learning based on the first data set includes determining one or more weights and biases associated with one or more neurons of the hidden layer of the neural network and storing a representation of the hidden layer of the neural network includes storing the determined one or more weights and biases associated with the one or more neurons of the hidden layer. Advantages can also include providing a user with the ability to train on an existing model as the base, rather than generating a new model.
In addition to one or more of the features described above or below, or as an alternative, further embodiments may include that the second data set is a set of words and each word of the set of words is included in the plurality of words of the relational database, and that updating the word embedding model based on the second data set and the stored representation of the hidden layer of the neural network includes updating a set of word vectors that corresponds to the set of words, wherein the set of word vectors is a subset of the plurality of word vectors. Advantages can also include limiting the processing required to update the model by limiting the update to portions of the neural network that are associated with the words included in the additional relational database data.
In addition to one or more of the features described above or below, or as an alternative, further embodiments may include that the second data set is a set of words and one or more new words, wherein the one or more new words are not included in the plurality of words of the relational database, and updating the word embedding model based on the second data set and the stored representation of the hidden layer of the neural network includes updating a set of word vectors that corresponds to the set of words and generating one or more new word vectors that correspond to the one or more new words, wherein the set of word vectors is a subset of the plurality of word vectors. Advantages can also include limiting the processing required to update the model by limiting the update to portions of the neural network that are associated with the words included in the additional relational database data and portions required to add the new one or more words.
In addition to one or more of the features described above or below, or as an alternative, further embodiments may include updating the word embedding model by updating a portion of the neural network based on the second data set. Advantages can also include limiting the processing required to update the model by limiting the update to a portion of the neural network.
In addition to one or more of the features described above or below, or as an alternative, further embodiments may include updating a portion of the neural network based on the second data set and that by updating the hidden layer to adjust weights and biases associated with neurons of the hidden layer based on the second data set and the method also includes storing a representation of the updated hidden layer. Advantages can also include providing the ability to incrementally update the word embedding model with further new data without retraining the model.
In addition to one or more of the features described above or below, or as an alternative, further embodiments may include that generating the word embedding model based on the first data set includes applying selected parameters to the first data set and a training of the neural network and that updating the word embedding model based on the second data set includes applying the selected parameters to the second data set and an incremental training of the neural network. Advantages can also include maintaining consistency of data integration during an update of the model.
Embodiments of the present invention include methods, systems, and computer program products for updating a word embedding model based on streaming data. A non-limiting example of a computer-implemented method includes receiving streaming data and continuously storing the streaming data as it is received. The method includes responsive to storing a first set of streaming data and determining that the first set of streaming data includes an amount of data that exceeds a first predetermined threshold, generating a word embedding model by training a neural network using unsupervised machine learning based on the first set of streaming data. The word embedding model includes a plurality of word vectors and each word vector of the plurality of word vectors corresponds to a unique word of the plurality of words. The method includes storing the plurality of word vectors and a representation of a hidden layer of the neural network. In response to storing a second set of streaming data, the method includes determining that an amount of the second set of streaming data exceeds a second predetermined threshold. The second set of streaming data is received chronologically after the first set of streaming data. The method further includes updating the word embedding model to create a first updated word embedding model based on the second set of streaming data and the stored representation of the hidden layer of the neural network. Advantages can include allowing automatic word embedding model updating in near real time based on streaming data.
In addition to one or more of the features described above or below, or as an alternative, further embodiments may include that training the neural network using unsupervised machine learning based on the first data set of streaming data includes determining one or more weights and biases associated with one or more neurons of the hidden layer of the neural network and that storing a representation of the hidden layer of the neural network includes storing the determined one or more weights and biases associated with the one or more neurons of the hidden layer. Advantages can also include providing the system with the ability to train on an existing model as the base, rather than generating a new model.
In addition to one or more of the features described above or below, or as an alternative, further embodiments may include that updating the word embedding model to create a first updated word embedding model includes incrementally training the neural network based on the second set of streaming data and the stored representation of the hidden layer of the neural network to adjust one or more weights and biases associated with one or more neurons of the hidden layer of the neural network. Advantages can also include limiting the processing required to update the model by limiting the update to portions of the neural network that are associated with the words included in the additional relational database data.
In addition to one or more of the features described above or below, or as an alternative, further embodiments may include that updating the word embedding model to create a first updated word embedding model comprises updating a portion of the neural network to adjust weights and biases associated with neurons of the hidden layer based on the second set of streaming data and the method further includes storing a representation of an updated hidden layer, determining that an amount of a third set of streaming data exceeds the second predetermined threshold in response to storing the third set of streaming data and updating the first updated word embedding model to create a second updated word embedding model based on the third set of streaming data and the stored representation of the updated hidden layer of the neural network. The third set of streaming data is received chronologically after the second set of streaming data. Advantages can also include automatic iterative updating of the word embedding model based on streaming data.
In addition to one or more of the features described above or below, or as an alternative, further embodiments may include that responsive to receiving a query of the word embedding model during streaming of the streaming data and before updating the word embedding model to create the first updated word embedding model, generating results of the query based on the word embedding model, responsive to receiving a query of the word embedding model during streaming of the streaming data, after updating the word embedding model to create a first updated word embedding model and before updating the word embedding model to create the second updated word embedding model, generating results of the query based on the first updated word embedding model, and responsive to receiving a query of the word embedding model during streaming of the streaming data and after updating the word embedding model to create the second updated word embedding model, generating results of the query based on the second updated word embedding model. Advantages can also include enabling a user to query a word embedding model derived from streaming data to receive results that are based on an updated model that incorporates near real time data.
A system for updating a word embedding model includes a memory having computer readable computer instructions, and a processor for executing the computer readable instructions that execute the steps of the computer-implemented method described above. A computer program product for updating a word embedding model includes a computer readable storage medium having program instructions embodied therewith to execute the steps of the computer-implemented method described above. A system for updating a word embedding model based on streaming data includes a memory having computer readable computer instructions, and a processor for executing the computer readable instructions that execute the steps of the computer-implemented method described above. A computer program product for updating a word embedding model based on streaming data includes a computer readable storage medium having program instructions embodied therewith to execute the steps of the computer-implemented method described above.
Additional technical features and benefits are realized through the techniques of the present invention. Embodiments and aspects of the invention are described in detail herein and are considered a part of the claimed subject matter. For a better understanding, refer to the detailed description and to the drawings.
The specifics of the exclusive rights described herein are particularly pointed out and distinctly claimed in the claims at the conclusion of the specification. The foregoing and other features and advantages of the embodiments of the invention are apparent from the following detailed description taken in conjunction with the accompanying drawings in which:
The diagrams depicted herein are illustrative. There can be many variations to the diagram or the operations described therein without departing from the spirit of the invention. For instance, the actions can be performed in a differing order or actions can be added, deleted or modified. Also, the term “coupled” and variations thereof describes having a communications path between two elements and does not imply a direct connection between the elements with no intervening elements/connections between them. All of these variations are considered a part of the specification.
In the accompanying figures and following detailed description of the disclosed embodiments, the various elements illustrated in the figures are provided with two or three digit reference numbers. With minor exceptions, the leftmost digit(s) of each reference number correspond to the figure in which its element is first illustrated.
Various embodiments of the invention are described herein with reference to the related drawings. Alternative embodiments of the invention can be devised without departing from the scope of this invention. Various connections and positional relationships (e.g., over, below, adjacent, etc.) are set forth between elements in the following description and in the drawings. These connections and/or positional relationships, unless specified otherwise, can be direct or indirect, and the present invention is not intended to be limiting in this respect. Accordingly, a coupling of entities can refer to either a direct or an indirect coupling, and a positional relationship between entities can be a direct or indirect positional relationship. Moreover, the various tasks and process steps described herein can be incorporated into a more comprehensive procedure or process having additional steps or functionality not described in detail herein.
The following definitions and abbreviations are to be used for the interpretation of the claims and the specification. As used herein, the terms “comprises,” “comprising,” “includes,” “including,” “has,” “having,” “contains” or “containing,” or any other variation thereof, are intended to cover a non-exclusive inclusion. For example, a composition, a mixture, process, method, article, or apparatus that comprises a list of elements is not necessarily limited to only those elements but can include other elements not expressly listed or inherent to such composition, mixture, process, method, article, or apparatus.
Additionally, the term “exemplary” is used herein to mean “serving as an example, instance or illustration.” Any embodiment or design described herein as “exemplary” is not necessarily to be construed as preferred or advantageous over other embodiments or designs. The terms “at least one” and “one or more” may be understood to include any integer number greater than or equal to one, i.e. one, two, three, four, etc. The terms “a plurality” may be understood to include any integer number greater than or equal to two, i.e. two, three, four, five, etc. The term “connection” may include both an indirect “connection” and a direct “connection.”
The terms “about,” “substantially,” “approximately,” and variations thereof, are intended to include the degree of error associated with measurement of the particular quantity based upon the equipment available at the time of filing the application. For example, “about” can include a range of ±8% or 5%, or 2% of a given value.
For the sake of brevity, conventional techniques related to making and using aspects of the invention may or may not be described in detail herein. In particular, various aspects of computing systems and specific computer programs to implement the various technical features described herein are well known. Accordingly, in the interest of brevity, many conventional implementation details are only mentioned briefly herein or are omitted entirely without providing the well-known system and/or process details.
As described above, generating word embedding models by training a neural network based on relational databases generally requires a large amount of processing time and resources. For example, the process of training a neural network to generate a word embedding model may take days. Unsurprisingly, this can significantly inhibit the development and use of such models. Further, for real world applications, it is often expected that the source data used to generate the word embedding model is not static, but rather will be augmented over time. For example, new data entries may be added to a relational database that is used to train a word embedding model. It may be desirable to update the word embedding model to incorporate such newly added data, however conventional techniques for doing so require the complete retraining of the word embedding model with the entirety of the relational database data (i.e., both the old and new relational database data), which incurs the high cost processing of training the model and introduces significant delay the availability of the updated model for use in querying.
The present disclosure solves the problem of incorporating new data into a word embedding model without the need to entirely retrain the model by providing novel techniques for incrementally and/or dynamically updating a word embedding model by generating an update to the model based on the new data and a stored portion of the previously trained neural network. As disclosed herein, computer-implemented methods enable relational databases to capture and exploit semantic contextual similarities using standard SQL queries and a class of SQL-based queries known as Cognitive Intelligence (CI) queries. For the results of such CI queries to remain useful, when new data is introduced to the underlying relational database from which the word embedding model was derived, it is necessary to account for the new data in the model as well. Thus, the word embedding model updating techniques described herein can provide an updated word embedding model that takes newly added data into account so that users may query the updated model in a timely fashion, relative to when the new data was added to the underlying relational database. Further, a word embedding model can be iteratively and incrementally updated using the disclosed techniques, which enables the model to be continually updated without repeatedly incurring the cost of retraining the entire model.
The techniques described herein allow for the process of generating a word embedding model that incorporates and reflects newly added data to be performed significantly faster and using less processing power than conventional methods. Accordingly, the disclosed solution provides technical advantages of significantly reducing the amount of computer processing resources and time needed to generate a word embedding model that is reflective of newly added data. These techniques also provide additional advantages of reducing the overall development time of a word embedding model. For example, a developer may often have to “guess and check” when selecting parameters of a word embedding model that will generating meaningful results.
In some instances where a developer is trying to determine whether a given set of model parameters yields a model with meaningful results, it may be helpful for the developer to add new data to the model to assess the impact of the new data on the meaningfulness of the results in order to assess whether the model parameters should be changed. However, iteratively adding new data to a word embedding model during development may be cost prohibitive under conventional approaches, as the developer may not be able to afford to wait multiple days between each update of the model. Accordingly, the techniques disclosed herein also allow for improved word embedding model development.
The incremental updating of word embedding models may be particularly useful in the context of a model based query that returns a similarity or dissimilarity result set does not contain the data that identifies what data the user desires the result set to be similar or dissimilar to, such as for example, querying a model of known criminals with a witness description of a particular suspect that was not previously part of the data of the relational database. In this case, the techniques described herein may allow the model to be quickly updated so that a potential list of matching suspects may generated in a timely fashion without having to wait days for the model to be retrained to incorporate the new suspect data. The disclosed techniques can provide the further benefit of allowing new applications of word embedding models that were previously unrealistic, such as for example, applications based on real-time, near real-time or streaming data, which may require the word embedding model to iteratively update with new data in a short time frame.
For a given relational database, such as a database containing information about employees of a specific company, typical SQL queries only return a result if there is a match for the query. For example, if a query wants information for employee A, such as salary, title, etc., an answer is returned only if there is an employee A. However, using CI queries, an answer may be returned by examining the relationship of each word embedded in the database by querying a word embedding model developed based on the database. For traditional SQL purposes, attributes such as name, age, gender, title, etc., are independent and this information is not exploited by the query.
Some embodiments of the present disclosure use word embedding, which is an unsupervised machine learning technique from natural language processing (NLP), to extract latent information. Disclosed techniques may also be applicable to other data models such as Multidimensional online analytical processing (MOLAP), JavaScript Object Notation (JSON), eXtensible Markup Language (XML), comma-separated value (CSV) files, spreadsheets, etc.
In word embedding, a d-dimensional vector space is fixed. Each word in a text corpus (e.g., collection of documents) is associated with a dimension d vector of real numbers. The assignment of words to vectors should be such that the vectors encode the meaning of the words. Ideally, if two words are closely related (i.e. have similar meaning), their vectors should point in similar directions. In other words, the cosine distance between their vectors should be relatively high. By closely related words we mean words that appear together often in the text corpus. By appear together, we mean within close proximity. Conversely, if words are unrelated, the cosine distance between their vectors should be relatively small. Some refinements of the calculation of closeness weigh the proximity and/or consider grammar rules.
Over the last few decades, a number of methods have been introduced for computing vector representations of words in a natural language, such as word2vec or GloVe. Recently, word2vec has gained prominence as the vectors produced appear to capture syntactic as well semantic properties of words. These vector representations seem to capture closeness of words and syntactic (e.g., present-past, singular-plural) as well as semantic closeness of words. One application of word2vec produced vectors was in solving analogy problems, such as . . . a king is to a man like what is to a woman? (answer: queen) by using vector algebra calculations.
Vectors may be produced by either learning on the database itself or using external text, or vector sources. In the relational database context, one way of generating vectors is to apply the word embedding method to a token sequence generated from the database: each row would correspond to a sentence and a relation would correspond to a document. Thus, vectors enable a dual view of the data: relational and (meaningful) text. Word embedding then may extract latent semantic information in terms of word associations and co-occurrences and encode it in word vectors. Thus, the vectors capture first inter- and intra-attribute relationships within a row (sentence) and then aggregate these relationships across the document to compute the collective semantic relationships. The encoded semantic information then may be used in querying the database. Some embodiments of the present invention integrate word embedding techniques and capabilities into traditional database systems.
Exemplary steps for enhancing a system 100 with the cognitive capabilities enabled by word vectors will be described with reference to
By way of introduction and overview (only) to the following example, it is assumed that the fields of a relational database are populated with information, e.g., relating to employees of a company (see e.g.,
Referring now to
Which rows or columns are textified (i.e., made into a sequence of tokens) may be controlled by defining a view using standard relational operations. The meaning of a word (i.e. token) can be inferred from by means of its neighbors. The neighborhood context contributes to the overall meaning of the word. A meaning of a database token can be determined from other tokens in the row, the columns of a row, in turn, can be determined by the schema of its view.
For example, meaningful data can be extracted and a model created by mapping, e.g., converting a relational row to a sentence (cf
At step 204, machine learning is used to produce word vectors for all words (tokens, items) in the text. For example, an algorithm can compute word vector representations for all words (optionally excluding header words) in the meaningful text. In some embodiments, an external source (or corpus) can also be used for model training (see e.g.,
At step 206, the word vectors are stored for usage in queries. In some embodiments, word vectors include a vector for each token in the meaningful text. At step 208, vectors produced from other text sources (see e.g., step 204 and
At step 210, cognitive intelligence (CI) queries are used to produce database relation results. In some embodiments, CI queries can be expressed using standard SQL. Some embodiments enable CI queries using the word vectors in the vector space as user-defined functions (UDFs). Upon completion of step 210, the process exits.
However, in many real life applications, relational databases that serve as the source data for a word embedding model may routinely be augmented with additional data. For example,
Next, as shown at block 604, the method 600 includes generating a word embedding model comprising a plurality of word vectors. Each word vector of the plurality of word vectors corresponds to a unique word (i.e., an entity) of the plurality of words, for example, as shown by word vectors 308 in
According to embodiments of the disclosure, a word embedding model may be generated from relational database data using an unsupervised approach based on the Word2Vec (W2V) implementation. As will be appreciated by those of skill in the art, unsupervised learning does not require a correct answer associated with each input pattern in the training data set, but rather explores the underlying structure in the data, or correlations between patterns in the data, and organizes patterns into categories from these correlations. The training approach may operate on the unstructured text corpus (as shown in
During the training process, the classical W2V implementation uses a simplified 3-layer shallow neural network that views the input text corpus as a sequence of sentences.
ANNs are often embodied as so-called “neuromorphic” systems of interconnected processor elements that act as simulated “neurons” and exchange “messages” between each other in the form of electronic signals. Similar to the so-called “plasticity” of synaptic neurotransmitter connections that carry messages between biological neurons, the connections in ANNs that carry electronic messages between simulated neurons are provided with numeric weights that correspond to the strength or weakness of a given connection. The weights can be adjusted and tuned based on experience, making ANNs adaptive to inputs and capable of learning. For example, an ANN for handwriting recognition is defined by a set of input neurons which can be activated by the pixels of an input image. After being weighted and transformed by a function determined by the network's designer, the activations of these input neurons are then passed to other downstream neurons, which are often referred to as “hidden” neurons. This process is repeated until an output neuron is activated. The activated output neuron determines which character was read.
In one or more examples, weight elements are stored in a weight storage element such as a capacitor and read by a weight reader, such as a field effect transistor (FET). Alternatively, or in addition, the weight storage elements can be digital counters (e.g. J-K flip-flop based counters), a memory storage device, or any other electronic circuit that can be used for storing the weight. Here, “weight” refers to a computational value being used during computations of an ANN as described further.
As shown in
For each word in a sentence, the W2V code defines a neighborhood window to compute the contributions of nearby words. Unlike deep learning based classifiers, the output of W2V is a set of vectors of real values of dimension d, one for each unique token in the training set (the vector space dimension d is independent of the token vocabulary size). According to some embodiments, a text token in a training set can represent either text, numeric, or image data. Thus, a word embedding model generated in accordance with the disclosure may build a joint latent representation that integrates information across different modalities using untyped uniform feature (or meaning) vectors.
According to some embodiments, training the neural network using unsupervised machine learning based on the first data set can include determining one or more weights and/or biases associated with one or more neurons of the hidden layer of the neural network and storing a representation of the hidden layer of the neural network comprises storing the determined one or more weights and/or biases associated with the one or more neurons of the hidden layer. In some embodiments, the neural network may include multiple hidden layers, and thus in some embodiments storing the hidden layers may include storing data (e.g., weights, biases, etc.) associated with each of the hidden layers.
Next, as shown at block 606, the method 600 includes storing the plurality of word vectors. The stored word vectors represent the word embedding model which may then be queried to determine, for example, entity similarity, dissimilarity, analogy, OLAP (Online Analytical Processing) and other such different query types.
Next, as shown at block 608, the method 600 includes storing a representation of a hidden layer of the neural network. According to some embodiments, storing a representation of the hidden layer can include storing the input values to the hidden layer, the weights and biases associated with the hidden layer, and/or the outputs of the hidden layer. According to some embodiments, if there are multiple hidden layers in the artificial neural network, storing a representation of the hidden layer can include storing the inputs, weights, biases and outputs associated with each layer (e.g., respectively associated with each neuron of a layer) of the hidden layers, where the outputs of one hidden layer may be the inputs of a next hidden layer. In some embodiments, the activation/transformation functions associated with one or more neurons of the hidden layer(s) may also be stored. The hidden layer can be stored by, for example, a memory of processing system, such as processing system 1100 shown in
Next, as shown at block 610, the method 600 includes receiving a second data set. The second data set comprises data that has been added to the relational database. For example, in some embodiments, the second dataset may be a new row that has been added to the relational database or table, such as for example, new row 502 added to the example relational database 402 shown in
Next, as shown at block 612, the method 600 includes updating the word embedding model based on the second data set and the stored representation of the hidden layer of the neural network. In some embodiments, the second data set may be a set of words and each word of the set of words can be included in the plurality of words of the relational database. In other words, in some embodiments, as shown by the example presented in
According to some embodiments, the second data set may include a set of words and one or more new words, wherein the one or more new words are words that are not included in the plurality of words of the relational database. For example, as shown in
According to some embodiments, updating the word embedding model comprises updating a portion of the neural network based on the second data set. For example, in some embodiments a new neural network may be built based on the new data and the saved hidden layer(s) of the initial/previous neural network used to generate the initial/previous word embedding model. If new words are present in the new data, more nodes are added to the input layer of the neural network. The new neural network with the new words (if any) can be trained repeatedly and one or more word vectors are updated to reflect inclusion of the new data in the model. In some embodiments, the portion of the neural network may be a relatively small portion of the entire neural network. For example, in some embodiments, updating a portion of the neural network based on the second data set can include updating the hidden layer of the neural network to adjust weights and/or biases associated with neurons of the hidden layer based on the second data set. The method can include storing a representation of the updated hidden layer. This process can be performed iteratively so that every time a new set of data is added to the relational database, the relevant portion of the word embedding model (i.e., the word vectors including words overlapping with words of the new data and any other word vectors that are tangentially affected by changes to the hidden layer of the neural network) are updated as described above, and the updated hidden layer of the neural network is saved as the starting point for the next update upon receiving yet another new set of data to the relational database. In this manner, the word embedding model can be interactively and incrementally updated to incorporate new data added to the relational database without having to retrain the model with all of the data of the updated relational database, thus saving large amounts of processing resources and computational time.
In some embodiments, generating the word embedding model based on the first data set may include applying selected parameters to the first data set and/or the training the neural network, such as for example, preprocessing applied to the data of the relational database or hyperparameters used in the generation of the word embedding model. Selected parameters may include but are not limited to, for example, a selection of columns of the relational database for inclusion in generation of the word embedding model, a selection of algorithms for determining relationships between words (e.g., bag of words, skip gram, etc.), the number of iterations, debugging parameters, window of analysis parameters (e.g., words before and after a given word) and other such parameters used in word embedding model generation. Preprocessing parameters may include for example, transforming data (e.g., transforming images, numbers, and/or other data formats to text) to a common format for comparison and clustering methods applied to the data of the relational database. For example, the category of “Salary” shown in relational database 402 of
Hyperparameters can be parameters specified by a designer that impact aspects of the neural network training, such as for example, the number of layers, the size of each layer, the number of connections, how many iterations are used to generate the model, which algorithms are applied to determine relationships between words, debugging parameters, subsampling, window size, and the like. Generally, it is beneficial to perform an update to the word embedding model using the same preprocessing methods, parameters and/or hyperparameters used to generate the original model. Accordingly, in some embodiments, the method may include storing the preprocessing methods, parameters and/or hyperparameters used to generate the word embedding model based on the first data set and applying the stored preprocessing methods, parameters and/or hyperparameters to the new data (e.g., the second data set) and/or the incremental training of the neural network as applicable.
Next, as shown at block 1004, the method 1000 includes continuously storing the streaming data as it is received. For example, streaming data may be stored in a relational database as it is received. According to some embodiments, the streaming data may be converted and stored in a format for use in creating and/or updating a word embedding model. In some embodiments, streaming data can be stored as structured data in, for example, a relational database. In some embodiments, data can be stored as unstructured data (e.g., social media text).
Next, as shown at block 1006, the method 1000 includes, in response to storing a first set of streaming data and determining that the first set of streaming data comprises an amount of data that exceeds a first predetermined threshold, generating a word embedding model comprising a plurality of word vectors. Each word vector of the plurality of word vectors may correspond to a unique word of the plurality of words. The word embedding model can be generated by training a neural network using unsupervised machine learning based on the first set of streaming data, in a manner similar to that previously described above. According to some embodiments, a first predetermined threshold can represent a minimum amount of data specified for creating an initial word embedding model. In other words, a designer of the system may determine that a minimum amount of data is needed before an initial model may have any value, so the system may simply acquire and store streaming data until it reaches the threshold, at which point the system may then generate an initial word embedding model based on the first set of streaming data.
According to some embodiments, training the neural network using unsupervised machine learning based on the first data set of streaming data may include determining one or more weights and/or biases associated with one or more neurons of the hidden layer of the neural network.
Next, as shown at blocks 1008 and 1010, the method 1000 includes storing the plurality of word vectors and storing a representation of a hidden layer of the neural network. The word vectors and representation of the hidden layer of the neural network may be stored in a manner similar to that described previously above. In some embodiments, storing a representation of the hidden layer of the neural network may include storing the one or more weights and/or biases associated with the one or more neurons of the hidden layer that are determined during the training of the neural network or that are updated during an incremental update of the neural network.
Next, as shown at block 1012, the method 1000 includes in response to storing a second set of streaming data, determining that an amount of the second set of streaming data exceeds a second predetermined threshold. The second set of streaming data may be streaming data that is received chronologically after the first set of streaming data. For example, in some embodiments, after the system has stored enough streaming data for the first set of streaming data to serve as the basis for the word embedding model, the subsequently received and stored streaming data may be viewed as the second set of streaming data to the point that the size or amount of data of the second set of data exceeds a second threshold. According to some embodiments, the second threshold may be considerably smaller than the first threshold, as the second threshold may serve as a cut-off to perform an incremental update of the previously trained word embedding model and may thus require significantly less data in order for the updated model to yield meaningful results.
Next, as shown at block 1014, the method 1000 includes updating, based on the second set of streaming data and the stored representation of the hidden layer of the neural network, the word embedding model to create a first updated word embedding model. In some embodiments, updating the word embedding model may be performed in accordance with some or all of the method 600 described above with respect to
According to some embodiments, updating the word embedding model to create a first updated word embedding model can include updating a portion of the neural network to adjust weights and/or biases associated with neurons of the hidden layer based on the second set of streaming data. Generally, updating a portion of the neural network may include updating data (e.g., inputs, outputs, weights, biases, etc.) associated with less than all of the neurons of the hidden layer. In other words, in some embodiments, the system may update only a fraction of the hidden layer to account for the impact of the newly added data on the model. In some cases, for example, if the newly added data is very large, it may be necessary to update the entire neural network. According to some embodiments, the system may update the entire neural network in response to detecting an anomaly in the newly added data, such as for example the system determining that clustering has become meaningless in view of the newly added data (e.g., because a large percentage of the data appears in the same cluster). In some embodiments, the method 1000 may further include storing a representation of an updated hidden layer, responsive to storing a third set of streaming data, determining that an amount of the third set of streaming data exceeds the second predetermined threshold, and updating, based on the third set of streaming data and the stored representation of the updated hidden layer of the neural network, the first updated word embedding model to create a second updated word embedding model. The third set of streaming data may be streaming data that is received chronologically after the second set of streaming data. In this way, the word embedding model can be iteratively updated with each new set of data (e.g., as determined by sequential data sets exceeding or meeting the second predetermined threshold) to allow the system to continually update the word embedding model as more and more streaming data is received. Such continuous updating in real time would be impossible with conventional methods of adding new relational data to a word embedding model because it would require a very long time to retrain the word embedding model from the start (e.g., hours or days) whereas the techniques for updating the word embedding model disclosed herein may allow the word embedding model to be updated in a matter of second or minutes.
Accordingly, using the techniques described herein, a user may be given the capability to query the word embedding model and receive results that incorporate recent streaming data. Thus, in some embodiments the method 1000 may further include responsive to receiving a query of the word embedding model during streaming of the streaming data and before updating the word embedding model to create the first updated word embedding model, generating results of the query based on the word embedding model, responsive to receiving a query of the word embedding model during streaming of the streaming data, after updating the word embedding model to create a first updated word embedding model and before updating the word embedding model to create the second updated word embedding model, generating results of the query based on the first updated word embedding model, and responsive to receiving a query of the word embedding model during streaming of the streaming data and after updating the word embedding model to create the second updated word embedding model, generating results of the query based on the second updated word embedding model. Thus, due to the iteratively updating nature of the word embedding model based on the continuously received streaming data, a user may receive near up-to-date results to queries at any time the word embedding model is queried during streaming of the data.
Additional processes may also be included. It should be understood that the processes depicted in
Referring to
In exemplary embodiments, the processing system 1100 includes a graphics processing unit 41. Graphics processing unit 41 is a specialized electronic circuit designed to manipulate and alter memory to accelerate the creation of images in a frame buffer intended for output to a display. In general, graphics processing unit 41 is very efficient at manipulating computer graphics and image processing and has a highly parallel structure that makes it more effective than general-purpose CPUs for algorithms where processing of large blocks of data is done in parallel.
Thus, as configured in
The present invention may be a system, a method, and/or a computer program product at any possible technical detail level of integration. The computer program product may include a computer readable storage medium (or media) having computer readable program instructions thereon for causing a processor to carry out aspects of the present invention.
The computer readable storage medium can be a tangible device that can retain and store instructions for use by an instruction execution device. The computer readable storage medium may be, for example, but is not limited to, an electronic storage device, a magnetic storage device, an optical storage device, an electromagnetic storage device, a semiconductor storage device, or any suitable combination of the foregoing. A non-exhaustive list of more specific examples of the computer readable storage medium includes the following: a portable computer diskette, a hard disk, a random access memory (RAM), a read-only memory (ROM), an erasable programmable read-only memory (EPROM or Flash memory), a static random access memory (SRAM), a portable compact disc read-only memory (CD-ROM), a digital versatile disk (DVD), a memory stick, a floppy disk, a mechanically encoded device such as punch-cards or raised structures in a groove having instructions recorded thereon, and any suitable combination of the foregoing. A computer readable storage medium, as used herein, is not to be construed as being transitory signals per se, such as radio waves or other freely propagating electromagnetic waves, electromagnetic waves propagating through a waveguide or other transmission media (e.g., light pulses passing through a fiber-optic cable), or electrical signals transmitted through a wire.
Computer readable program instructions described herein can be downloaded to respective computing/processing devices from a computer readable storage medium or to an external computer or external storage device via a network, for example, the Internet, a local area network, a wide area network and/or a wireless network. The network may comprise copper transmission cables, optical transmission fibers, wireless transmission, routers, firewalls, switches, gateway computers and/or edge servers. A network adapter card or network interface in each computing/processing device receives computer readable program instructions from the network and forwards the computer readable program instructions for storage in a computer readable storage medium within the respective computing/processing device.
Computer readable program instructions for carrying out operations of the present invention may be assembler instructions, instruction-set-architecture (ISA) instructions, machine instructions, machine dependent instructions, microcode, firmware instructions, state-setting data, configuration data for integrated circuitry, or either source code or object code written in any combination of one or more programming languages, including an object oriented programming language such as Smalltalk, C++, or the like, and procedural programming languages, such as the “C” programming language or similar programming languages. The computer readable program instructions may execute entirely on the user's computer, partly on the user's computer, as a stand-alone software package, partly on the user's computer and partly on a remote computer or entirely on the remote computer or server. In the latter scenario, the remote computer may be connected to the user's computer through any type of network, including a local area network (LAN) or a wide area network (WAN), or the connection may be made to an external computer (for example, through the Internet using an Internet Service Provider). In some embodiments, electronic circuitry including, for example, programmable logic circuitry, field-programmable gate arrays (FPGA), or programmable logic arrays (PLA) may execute the computer readable program instruction by utilizing state information of the computer readable program instructions to personalize the electronic circuitry, in order to perform aspects of the present invention.
Aspects of the present invention are described herein with reference to flowchart illustrations and/or block diagrams of methods, apparatus (systems), and computer program products according to embodiments of the invention. It will be understood that each block of the flowchart illustrations and/or block diagrams, and combinations of blocks in the flowchart illustrations and/or block diagrams, can be implemented by computer readable program instructions.
These computer readable program instructions may be provided to a processor of a general purpose computer, special purpose computer, or other programmable data processing apparatus to produce a machine, such that the instructions, which execute via the processor of the computer or other programmable data processing apparatus, create means for implementing the functions/acts specified in the flowchart and/or block diagram block or blocks. These computer readable program instructions may also be stored in a computer readable storage medium that can direct a computer, a programmable data processing apparatus, and/or other devices to function in a particular manner, such that the computer readable storage medium having instructions stored therein comprises an article of manufacture including instructions which implement aspects of the function/act specified in the flowchart and/or block diagram block or blocks.
The computer readable program instructions may also be loaded onto a computer, other programmable data processing apparatus, or other device to cause a series of operational steps to be performed on the computer, other programmable apparatus or other device to produce a computer implemented process, such that the instructions which execute on the computer, other programmable apparatus, or other device implement the functions/acts specified in the flowchart and/or block diagram block or blocks.
The flowchart and block diagrams in the Figures illustrate the architecture, functionality, and operation of possible implementations of systems, methods, and computer program products according to various embodiments of the present invention. In this regard, each block in the flowchart or block diagrams may represent a module, segment, or portion of instructions, which comprises one or more executable instructions for implementing the specified logical function(s). In some alternative implementations, the functions noted in the blocks may occur out of the order noted in the Figures. For example, two blocks shown in succession may, in fact, be executed substantially concurrently, or the blocks may sometimes be executed in the reverse order, depending upon the functionality involved. It will also be noted that each block of the block diagrams and/or flowchart illustration, and combinations of blocks in the block diagrams and/or flowchart illustration, can be implemented by special purpose hardware-based systems that perform the specified functions or acts or carry out combinations of special purpose hardware and computer instructions.
The descriptions of the various embodiments of the present invention have been presented for purposes of illustration, but are not intended to be exhaustive or limited to the embodiments disclosed. Many modifications and variations will be apparent to those of ordinary skill in the art without departing from the scope and spirit of the described embodiments. The terminology used herein was chosen to best explain the principles of the embodiments, the practical application or technical improvement over technologies found in the marketplace, or to enable others of ordinary skill in the art to understand the embodiments described herein.
Harding, Christopher, Warren, Stephen, Bordawekar, Rajesh, Neves, Jose, Conti, Thomas
Patent | Priority | Assignee | Title |
Patent | Priority | Assignee | Title |
10019438, | Mar 18 2016 | International Business Machines Corporation | External word embedding neural network language models |
9430563, | Feb 02 2012 | Xerox Corporation | Document processing employing probabilistic topic modeling of documents represented as text words transformed to a continuous space |
9922025, | May 08 2015 | International Business Machines Corporation | Generating distributed word embeddings using structured information |
9947314, | May 08 2015 | International Business Machines Corporation | Semi-supervised learning of word embeddings |
20150220833, | |||
20160162467, | |||
20160350288, | |||
20160358094, | |||
20170011289, | |||
20170139984, | |||
20170270100, | |||
20180068371, | |||
20180090128, | |||
20180113938, | |||
20180157644, | |||
20180189265, | |||
20180196800, | |||
20180267976, | |||
20180267977, | |||
20180268025, | |||
20190286704, | |||
20200057936, | |||
20200104367, | |||
20200159832, |
Executed on | Assignor | Assignee | Conveyance | Frame | Reel | Doc |
Nov 27 2018 | WARREN, STEPHEN | International Business Machines Corporation | ASSIGNMENT OF ASSIGNORS INTEREST SEE DOCUMENT FOR DETAILS | 047626 | /0946 | |
Nov 27 2018 | BORDAWEKAR, RAJESH | International Business Machines Corporation | ASSIGNMENT OF ASSIGNORS INTEREST SEE DOCUMENT FOR DETAILS | 047626 | /0946 | |
Nov 27 2018 | NEVES, JOSE | International Business Machines Corporation | ASSIGNMENT OF ASSIGNORS INTEREST SEE DOCUMENT FOR DETAILS | 047626 | /0946 | |
Nov 28 2018 | CONTI, THOMAS | International Business Machines Corporation | ASSIGNMENT OF ASSIGNORS INTEREST SEE DOCUMENT FOR DETAILS | 047626 | /0946 | |
Nov 28 2018 | HARDING, CHRISTOPHER | International Business Machines Corporation | ASSIGNMENT OF ASSIGNORS INTEREST SEE DOCUMENT FOR DETAILS | 047626 | /0946 | |
Nov 29 2018 | International Business Machines Corporation | (assignment on the face of the patent) | / |
Date | Maintenance Fee Events |
Nov 29 2018 | BIG: Entity status set to Undiscounted (note the period is included in the code). |
Date | Maintenance Schedule |
Aug 09 2025 | 4 years fee payment window open |
Feb 09 2026 | 6 months grace period start (w surcharge) |
Aug 09 2026 | patent expiry (for year 4) |
Aug 09 2028 | 2 years to revive unintentionally abandoned end. (for year 4) |
Aug 09 2029 | 8 years fee payment window open |
Feb 09 2030 | 6 months grace period start (w surcharge) |
Aug 09 2030 | patent expiry (for year 8) |
Aug 09 2032 | 2 years to revive unintentionally abandoned end. (for year 8) |
Aug 09 2033 | 12 years fee payment window open |
Feb 09 2034 | 6 months grace period start (w surcharge) |
Aug 09 2034 | patent expiry (for year 12) |
Aug 09 2036 | 2 years to revive unintentionally abandoned end. (for year 12) |