A method of clustering files, comprises, by a processing unit:
|
6. A method of clustering files, comprising, by a processing unit:
obtaining a plurality of data (Dsignal, 1, . . . , Dsignal, Z) representative of a plurality of files (Dfile, 1, . . . , Dfile, Z) to be clustered,
building a clustering structure comprising a plurality of nodes arranged in hierarchical levels li, with i from 1 to N, wherein each node is representative of a category of files, wherein said category is representative of files sharing similarities,
wherein said building comprises, based on said plurality of data (Dsignal, 1, . . . , Dsignal, Z):
wherein the method comprises, for data which does not meet an acceptance threshold of any node of the level li, creating a new node in the level li, wherein a centroid of said new node is determined based at least on said data,
wherein, during said building, at least part of said plurality of data (Dsignal, 1, . . . , Dsignal, Z) or of said plurality of files (Dfile, 1, . . . , Dfile, Z) is each associated with one or more of the nodes of the clustering structure, thereby reflecting that said node is representative of a category of files and facilitating future identification of a category of a file based on said clustering structure.
10. A system for clustering files, comprising, by a processing unit:
obtain a clustering structure comprising a plurality of nodes arranged in hierarchical levels li, with i from 1 to N,
wherein each node is representative of a category of files, wherein said category is representative of files sharing similarities,
wherein each node of level li is linked to a parent node of level li−1, with i from 2 to N, thereby indicating that each data belonging to a category represented by said node also belongs to a category represented by said parent node,
wherein each node is associated with at least one acceptance threshold,
wherein each node is associated with at least one centroid representative of files belonging to a category represented by said node,
obtain at least one data (Dsignal) representative of a file (Dfile) to be assigned to a category;
(O1) compare said data to each centroid of each node of the first level,
(O2) when said comparison matches the acceptance threshold of one or more nodes, select a node among these nodes, and when said comparison does not meet an acceptance threshold of any node, create a new node in the first level, wherein a centroid of said new node is determined based at least on Dsignal;
(O3) compare Dsignal to each centroid of each node of a next level which is linked to said selected node,
(O4) when said comparison matches the acceptance threshold of one or more nodes, select a node among these nodes, and when said comparison does not meet an acceptance threshold of any node, create a new node,
wherein, for a level li with i>1, said new node is linked to said selected node,
wherein a centroid of said new node is determined based at least on Dsignal;
repeat O3 and O4 until a stopping condition is met, thereby indicating that said data Dsignal or said file Dfile belongs to a category of files represented by said selected node wherein a centroid and an acceptance threshold of said selected node are updated based on said data Dsignal.
1. A method of clustering files, comprising, by a processing unit:
obtaining a clustering structure comprising a plurality of nodes arranged in hierarchical levels li, with i from 1 to N,
wherein each node is representative of a category of files, wherein said category is representative of files sharing similarities,
wherein each node of level li is linked to a parent node of level li−1, with i from 2 to N, thereby indicating that each data belonging to a category represented by said node also belongs to a category represented by said parent node,
wherein each node is associated with at least one acceptance threshold,
wherein each node associated with at least one centroid representative of files belonging to a category represented by said node,
obtaining at least one data (Dsignal) representative of a file (Dfile) to be assigned to a category;
(O1) comparing said data to each centroid of each node of the first level,
(O2) when said comparison matches the acceptance threshold of one or more nodes, selecting a node among these nodes, and when said comparison does not meet an acceptance threshold of any node, creating a new node in the first level, wherein a centroid of said new node is determined based at least on Dsignal;
(O3) comparing Dsignal to each centroid of each node of a next level which is linked to said selected node,
(O4) when said comparison matches the acceptance threshold of one or more nodes, selecting a node among these nodes, and when said comparison does not meet an acceptance threshold of any node, creating a new node, wherein, for a level li with i>1, said new node is linked to said selected node, wherein a centroid of said new node is determined based at least on Dsignal;
repeating O3 and O4 until a stopping condition is met, thereby indicating that said data Dsignal or said file Dfile belongs to a category of files represented by said selected node, wherein a centroid and an acceptance threshold of said selected node are updated based on said data Dsignal.
18. A non-transitory storage device readable by a machine, tangibly embodying a program of instructions executable by the machine to perform a method of clustering files comprising:
obtaining a clustering structure comprising a plurality of nodes arranged in hierarchical levels li, with i from 1 to N,
wherein each node is representative of a category of files, wherein said category is representative of files sharing similarities,
wherein each node of level li is linked to a parent node of level li−1, with i from 2 to N, thereby indicating that each data belonging to a category represented by said node also belongs to a category represented by said parent node,
wherein each node is associated with at least one acceptance threshold,
wherein each node associated with at least one centroid representative of files belonging to a category represented by said node,
obtaining at least one data (Dsignal) representative of a file (Dfile) to be assigned to a category;
(O1) comparing said data to each centroid of each node of the first level,
(O2) when said comparison matches the acceptance threshold of one or more nodes, selecting a node among these nodes, and when said comparison does not meet an acceptance threshold of any node, creating a new node in the first level, wherein a centroid of said new node is determined based at least on Dsignal;
(O3) comparing Dsignal to each centroid of each node of a next level which is linked to said selected node,
(O4) when said comparison matches the acceptance threshold of one or more nodes, selecting a node among these nodes, and when said comparison does not meet an acceptance threshold of any node, creating a new node,
wherein, for a level li with i>1, said new node is linked to said selected node,
wherein a centroid of said new node is determined based at least on Dsignal
repeating O3 and O4 until a stopping condition is met, thereby indicating that said data Dsignal or said file Dfile belongs to a category of files represented by said selected node, wherein a centroid and an acceptance threshold of said selected node are updated based on said data Dsignal.
15. A system for clustering files, comprising, by a processing unit:
obtain a plurality of data (Dsignal, 1, . . . , Dsignal, Z) representative of a plurality of files (Dfile, 1, . . . , Dfile, Z) to be clustered,
build a clustering structure comprising a plurality of nodes arranged in hierarchical levels li, with i from 1 to N, wherein each node is representative of a category of files, wherein said category is representative of files sharing similarities,
wherein said building comprises, based on said plurality of data (Dsignal, 1, . . . , Dsignal, Z):
obtaining one or mode nodes of level li, wherein each node is associated with an acceptance threshold and a centroid representative of files belonging to a category represented by said node,
performing repetitively, for i=2 to N:
building one or more nodes of level li, wherein each node of level li is linked to a parent node of level li−1, with i from 2 to N, thereby indicating that each file belonging to a category represented by said node also belongs to a category represented by said parent node,
wherein each node is associated with at least one acceptance threshold and at least one centroid representative of files belonging to a category represented by said node,
wherein, for each node, said acceptance threshold and said centroid are usable for defining which file belongs to the category represented by said node,
wherein a value of an acceptance threshold of a node is dynamically updated based on data which is associated with said node during building of the clustering structure, thereby allowing said data to influence said acceptance threshold,
wherein the system is configured to, for data which does not meet an acceptance threshold of any node of the level li, create a new node in the level li, wherein a centroid of said new node is determined based at least on said data,
wherein, during said building, at least part of said plurality of data (Dsignal, 1, . . . , Dsignal, Z), or of said plurality of files (Dfile, 1, . . . , Dfile, Z) is each associated with one or more of the nodes of the clustering structure, thereby reflecting that said node is representative of a category of files and facilitating future identification of a category of a file based on said clustering structure.
19. A non-transitory storage device readable by a machine, tangibly embodying a program of instructions executable by the machine to perform a method of clustering files comprising:
obtaining a plurality of data (Dsignal, 1, . . . , Dsignal, Z) representative of a plurality of files (Dfile, 1, . . . , Dfile, Z) to be clustered,
building a clustering structure comprising a plurality of nodes arranged in hierarchical levels li, with i from 1 to N, wherein each node is representative of a category of files, wherein said category is representative of files sharing similarities,
wherein said building comprises, based on said plurality of data (Dsignal, 1, . . . , Dsignal, Z):
obtaining one or mode nodes of level l1, wherein each node is associated with an acceptance threshold and a centroid representative of files belonging to a category represented by said node,
performing repetitively, for i=2 to N:
building one or more nodes of level li, wherein each node of level li is linked to a parent node of level li−1, with i from 2 to N, thereby indicating that each file belonging to a category represented by said node also belongs to a category represented by said parent node,
wherein each node is associated with at least one acceptance threshold and at least one centroid representative of files belonging to a category represented by said node,
wherein, for each node, said acceptance threshold and said centroid are usable for defining which file belongs to category represented by said node,
wherein a value of an acceptance threshold of a node is dynamically updated based on data which is associated with said node during building of the clustering structure, thereby allowing said data to influence said acceptance threshold,
wherein the method comprises, for data which does not meet an acceptance threshold of any node of the level li, creating a new node in the level li, wherein a centroid of said new node is determined based at least on said data,
wherein, during said building, at least part of said plurality of data (Dsignal, 1, . . . , Dsignal, Z) or of said plurality of files (Dfile, 1, . . . , Dfile, Z) is each associated with one or more of the nodes of the clustering structure, thereby reflecting that said node is representative of a category of files and facilitating future identification of a category of a file based on said clustering structure.
2. The method of
for data for which said comparison does not meet an acceptance threshold of any nodes, performing at least one of (a), (b), (c) and (d):
a) providing an output that Dsignal or Dfile does not belong to any category of the clustering structure,
b) providing an output that Dsignal or Dfile does not belong to any category of level li of the clustering structure,
c) providing an output that Dsignal or Dfile does not belong to any sub-category of a category represented by said selected node,
d) triggering an action representative of a reject of data Dsignal or Dfile.
3. The method of
said stopping condition is met when said selected node is not a parent node of any node in a next level;
said stopping condition is met when said comparison meets an acceptance threshold of said selected node, wherein said acceptance threshold is above a predefined confidence value.
4. The method of
providing an access to Dsignal or Dfile which is limited depending at least on a category determined for Dsignal or Dfile, and
identifying or tagging Dsignal or Dfile based on characteristics of one or more files previously identified as being associated with said selected node.
5. The method of
7. The method of
8. The method of
for i=1, comparing each of one or more data (Dsignal, 1, . . . , Dsignal, Z) to each centroid of each node in level l1,
obtaining one or mode nodes of level l1, wherein each node is associated with an acceptance threshold and a centroid representative of files belonging to a category represented by said node,
performing repetitively, for i=2 to N:
building one or more nodes of level li, wherein each node of level li, is linked to a parent node of level li−1, with i from 2 to N, thereby indicating that each file belonging to a category represented by said node also belongs to a category represented by said parent node,
wherein each node is associated with at least one acceptance threshold and at least one centroid representative of files belonging to a category represented by said node,
wherein, for each node, said acceptance threshold and said centroid are usable for defining which file belongs to category represented by said node,
wherein a value of an acceptance threshold of a node is dynamically updated based on data which is associated with said node during building of the clustering structure, thereby allowing said data to influence said acceptance threshold,
for i>1, for a parent node of level li−1 with which a subset of data (Dsignal, 1, . . . , Dsignal, Z) is associated, comparing each data of said subset to each centroid of each node of level li which is linked to said parent node,
for said data,
when said comparison meets an acceptance threshold of one or more nodes, associating said data with one of said nodes,
when said comparison does not meet an acceptance threshold of any node, creating a new node in level li, wherein, for i>1, said new node is linked to said parent node of level li−1.
9. The method of
(A) after building nodes of level l1, wherein after said building each node is associated with a centroid having a first value and reflecting data associated with said node, performing at least once a verification comprising:
attempting to associate each data of said plurality of data (Dsignal, 1, . . . , Dsignal, Z) with a node of level l1, by determining whether a comparison of said data with a centroid of said matches an acceptance threshold of said node, and updating said first value of said centroid of each of one or more nodes based on data associated with said node,
(B) after building nodes of level li, linked with a parent node of level li for at least one value of i>1, wherein after said building each node is associated with a centroid having a first value and reflecting data associated with said node, performing at least once a verification comprising:
attempting to associate each data associated with said parent node of level li with a node of level li linked with said parent node, by determining whether a comparison of said data with a centroid of said node matches an acceptance threshold of said node, and
updating said first value of said centroid of each of one or more nodes based on data associated with said node.
11. The system of
for data for which said comparison does not meet an acceptance threshold of any nodes perform at least one of (a), (b), (c), (d) and (e):
a) provide an output that Dsignal or Dfile does not belong to any category of the clustering structure,
b) provide an output that Dsignal or Dfile does not belong to any category of level li of the clustering structure,
c) provide an output that Dsignal or Dfile does not belong to any sub-category of a category represented by said selected node,
d) trigger an action representative of a reject of data Dsignal or Dfile,
e) create a new node, wherein, for a level li with i>1, said new node is linked to said selected node, wherein a centroid of said new node is determined based at least on Dsignal.
12. The system of
said stopping condition is met when said selected node is not a parent node of any node in a next level; or
said stopping condition is met when said comparison meets an acceptance threshold of said selected node, wherein said acceptance threshold is above a predefined confidence value.
13. The system of
providing an access to Dsignal or Dfile which is limited depending at least on a category determined for Dsignal or Dfile, and
identifying or tagging Dsignal or Dfile based on characteristics of one or more files previously identified as being associated with said selected node.
14. The system of
16. The system of
for i=1, comparing each of one or more data (Dsignal, 1, . . . , Dsignal, Z) to each centroid of each node in level l1,
for i>1, for a parent node of level li−1 with which a subset of data (Dsignal, 1, . . . , Dsignal, Z) is associated, comparing each data of said subset to each centroid of each node of level li which is linked to said parent node,
for said data,
when said comparison meets an acceptance threshold of one or more nodes, associating said data with one of said nodes,
when said comparison does not meet an acceptance threshold of any node, creating a new node in level li, wherein, for i>1, said new node is linked to said parent node of level li−1.
17. The system of
(A) after building nodes of level l1, wherein after said building each node is associated with a centroid having a first value and reflecting data associated with said node, performing at least once a verification comprising:
attempting to associate each data of said plurality of data (Dsignal, 1, . . . Dsignal, Z) with a node of level li, by determining whether a comparison of said data with a centroid of said matches an acceptance threshold of said node, and updating said first value of said centroid of each of one or more nodes based on data associated with said node,
(B) after building nodes of level li linked with a parent node of level li for at least one value of i>1, wherein after said building each node is associated with a centroid having a first value and reflecting data associated with said node, performing at least once a verification comprising:
attempting to associate each data associated with said parent node of level li with a node of level li linked with said parent node, by determining whether a comparison of said data with a centroid of said node matches an acceptance threshold of said node, and
updating said first value of said centroid of each of one or more nodes based on data associated with said node.
|
The presently disclosed subject matter relates to methods and systems for clustering data, such as files.
Systems and methods for clustering/classifying data are used in various technical fields. For example, a firm can store huge amounts of files in various servers, and it can be required to classify these files based on their nature or category.
Assume an example in which a data set is classified by this clustering method into three different clusters or categories (cluster 100, schematically represented by squares, cluster 110, schematically represented by circles, and cluster 120, schematically represented by crosses).
This clustering method suffers from several drawbacks.
Firstly, this clustering method is a supervised method, in which an operator has to define a priori the number of clusters. For example, in
Secondly, an operator has to provide “initial conditions”, that it to say that he has to perform some initialization of the clustering method. Quality of the clustering strongly depends on these initial conditions.
Thirdly, when new data is to be classified, this clustering method attempts to identify the closest cluster. For example, as shown in
This approach is not optimal, since, although the closest cluster is identified, this cluster can be in fact very far from the content of the new data (as shown for example in
Fourthly, when new data is associated with a cluster, all the other clusters need to be redefined (this drawback is known as “refactoring” in the art). This approach is therefore time consuming and requires high processing capability.
Lastly, when it is attempted to cluster new data, this new data needs to be compared to all existing clusters (this clustering method is thus a O(N) method, wherein N is the number of clusters). This approach is therefore time consuming and requires high processing capability.
There is now a need to provide new methods and systems for clustering data, and in particular files.
In accordance with certain aspects of the presently disclosed subject matter, there is provided a method of clustering files, comprising, by a processing unit:
In addition to the above features, the method according to this aspect of the presently disclosed subject matter can optionally comprise one or more of features (i) to (v) below, in any technically possible combination or permutation:
According to another aspect of the presently disclosed subject matter there is provided a method of clustering files, comprising, by a processing unit:
wherein, for each node, said acceptance threshold and said centroid are usable for defining which file belongs to category represented by said node,
wherein, during said building, at least part of said plurality of data (Dsignal, 1, . . . , Dsignal, Z) or of said plurality of files (Dfile, 1, . . . , Dfile, Z) is each associated with one or more of the nodes of the clustering structure, thereby reflecting that said node is representative of a category of files and facilitating future identification of a category of a file based on said clustering structure.
In addition to the above features, the method according to this aspect of the presently disclosed subject matter can optionally comprise one or more of features (vi) to (x) below, in any technically possible combination or permutation:
According to another aspect of the presently disclosed subject matter there is provided a system for clustering files, comprising, by a processing unit:
In addition to the above features, the system according to this aspect of the presently disclosed subject matter can optionally comprise one or more of features (xi) to (xiv) below, in any technically possible combination or permutation:
According to another aspect of the presently disclosed subject matter there is provided a system for clustering files, comprising, by a processing unit:
wherein, for each node, said acceptance threshold and said centroid are usable for defining which file belongs to the category represented by said node,
wherein, during said building, at least part of said plurality of data (Dsignal, 1, . . . , Dsignal, Z), or of said plurality of files (Dfile, 1, . . . , Dfile, Z) is each associated with one or more of the nodes of the clustering structure, thereby reflecting that said node is representative of a category of files and facilitating future identification of a category of a file based on said clustering structure.
In addition to the above features, the system according to this aspect of the presently disclosed subject matter can optionally comprise one or more of features (xv) to (xviii) below, in any technically possible combination or permutation:
According to another aspect of the presently disclosed subject matter there is provided a non-transitory storage device readable by a machine, tangibly embodying a program of instructions executable by the machine to perform a method of clustering files comprising:
In addition to the above features, the non-transitory storage device according to this aspect of the presently disclosed subject matter can optionally perform a method comprising one or more of features (i) to (v) above, in any technically possible combination or permutation.
According to another aspect of the presently disclosed subject matter there is provided a non-transitory storage device readable by a machine, tangibly embodying a program of instructions executable by the machine to perform a method of clustering files comprising:
wherein, for each node, said acceptance threshold and said centroid are usable for defining which file belongs to category represented by said node,
wherein, during said building, at least part of said plurality of data (Dsignal, 1, . . . , Dsignal, Z) or of said plurality of files (Dfile, 1, . . . , Dsignal, Z) is each associated with one or more of the nodes of the clustering structure, thereby reflecting that said node is representative of a category of files and facilitating future identification of a category of a file based on said clustering structure.
In addition to the above features, the non-transitory storage device according to this aspect of the presently disclosed subject matter can optionally perform a method comprising one or more of features (vi) to (x) above, in any technically possible combination or permutation.
According to some embodiments, the proposed solution is able to classify huge numbers of files into categories of files sharing similarities.
In particular, according to some embodiments, the proposed solution can identify different versions of a file (e.g. which is updated over time by one or more users) and to classify these versions into a single category.
According to some embodiments, the proposed solution is able to classify huge numbers of files into categories of files sharing similarities, thereby allowing handling access to these files based on profiles of users.
According to some embodiments, the proposed solution reduces time and processing required for classifying data such as files.
According to some embodiments, the proposed solution is unsupervised, and does not require from an operator to define a priori a number of clusters/categories.
According to some embodiments, the proposed solution is unsupervised, and does not require an operator to provide an a priori knowledge on the content of the data.
According to some embodiments, the proposed solution does not strongly depend on initial conditions provided by an operator.
According to some embodiments, when new data is to be clustered, the proposed solution does not require refactoring.
According to some embodiments, the proposed solution is adaptive to new data such as new files, and in particular, is able to create new clusters, and/or to reject new data/new files that do not fit with existing clusters/categories.
According to some embodiments, the proposed solution lets data/files dictate parameters of the clusters/categories, thereby proposing a customized and adaptive clustering.
In order to understand the invention and to see how it can be carried out in practice, embodiments will be described, by way of non-limiting examples, with reference to the accompanying drawings, in which:
In the following detailed description, numerous specific details are set forth in order to provide a thorough understanding of the invention. However, it will be understood by those skilled in the art that the presently disclosed subject matter may be practiced without these specific details. In other instances, well-known methods have not been described in detail so as not to obscure the presently disclosed subject matter.
Unless specifically stated otherwise, as apparent from the following discussions, it is appreciated that throughout the specification discussions utilizing terms such as “obtaining”, “comparing”, “selecting”, “associating”, “creating”, “identifying”, “tagging” or the like, refer to the action(s) and/or process(es) of a processing unit that manipulates and/or transforms data into other data, said data represented as physical, such as electronic, quantities and/or said data representing the physical objects.
The term “processing unit” covers any computing unit or electronic unit with data processing circuitry that may perform tasks based on instructions stored in a memory, such as a computer, a server, a chip, a processor, a hardware processor, etc. It encompasses a single processor or multiple processors, which may be located in the same geographical zone or may, at least partially, be located in different zones and may be able to communicate together.
The term “memory” as used herein should be expansively construed to cover any volatile or non-volatile computer memory suitable to the presently disclosed subject matter.
Embodiments of the presently disclosed subject matter are not described with reference to any particular programming language. It will be appreciated that a variety of programming languages may be used to implement the teachings of the presently disclosed subject matter as described herein.
The invention contemplates a computer program being readable by a computer for executing one or more methods of the invention. The invention further contemplates a machine-readable memory tangibly embodying a program of instructions executable by the machine for executing one or more methods of the invention.
For example, system 200 can have a read-access authorization to a server in order to extract content of the data 240 to be clustered.
The system 200 can comprise at least one processing unit 210 and at least one memory 220. According to some embodiments, the memory 220 is not part of the system 200 but can communicate with system 200 using known wireless/wire communication network.
As explained hereinafter in the specification, memory 220 can store at least:
According to some embodiments, the system 200 can comprise, or can communicate with a user interface 230. User interface 230 can comprise e.g. a display allowing a user to visualize output of the clustering. According to some embodiments, user interface 230 can comprise an interface (graphical interface, or physical interface such as a keyboard) allowing the user to perform requests or and/or provide data to the system 200.
The system 200 (alone or in combination with another processing unit) can be used to perform one or more embodiments of the various methods described hereinafter.
Attention is now drawn to
The method can comprise operation 300, in which a plurality of data (Dsignal, 1, . . . , Dsignal, Z, representative of a plurality of piece of data (Draw, 1, . . . , Draw, Z) to be clustered is obtained.
For example, assume a plurality of files (Draw, 1, . . . , Draw, Z) or (Dfile, 1, . . . , Dfile, Z) stored in various servers of a firm need to be clustered. Specific examples will be provided hereinafter.
Data (Dsignal, 1, . . . , Dsignal, Z) representative of this data (Draw, 1, . . . , Draw, Z) can include e.g. a binary content of these files, a vector and/or matrix representative of this binary content, a mathematical encoding of this binary content, etc.
According to some embodiments, data (Dsignal, 1, . . . , Dsignal, Z) can be obtained by performing a conversion of each data (Draw, 1, . . . , Draw, Z) into a corresponding signal using the method described in patent application U.S. Ser. No. 15/360,612.
According to some embodiments, (Dsignal, 1, . . . , Dsignal, Z) and (Draw, 1, . . . , Draw, Z) are equal, depending on the type of data to be clustered.
According to some embodiments, each data (Draw, 1, . . . , Draw, Z) can be an image, and each data (Dsignal, 1, . . . , Dsignal, Z) can comprise a vector or a matrix representative of pixels of each image.
According to some embodiments, each data (Draw, 1, . . . , Draw, Z) can be results of medical tests, and each data (Dsignal, 1, . . . , Dsignal, Z) can comprise a vector or a matrix representative of these results.
These examples are not limitative and various other data can be used.
(Dsignal, 1, . . . , Dsignal, Z) can comprise a mathematical representation of (Draw, 1, . . . , Draw, Z) that can be processed in the clustering method.
The method can further comprise (operation 310), based at least on data (Dsignal, 1, . . . , Dsignal, Z), building a clustering structure comprising a plurality of nodes Nj,L
As explained hereinafter, each node Nj,L
The higher the index “i” of the level, the more the division into categories is fine, and the higher the resolution and differentiation between data/files (in other words, nodes of levels with index “i” of low value represent general categories, whereas nodes of levels with index “i” of higher value represent sub-categories of these general categories).
In addition, once the clustering structure is built according to the various embodiments described hereinafter, the acceptance threshold of a children node is generally stricter than the acceptance threshold of its parent node (stricter means that a higher correspondence with the centroid of the node is required to be associated with the node, and generally this implies that the acceptance threshold has a higher value).
This reflects the fact that the higher the index “i” of the level, the more the division into categories is fine (and therefore the similarities between the files of a same node are higher). This can be obtained in particular using a method in which the acceptance threshold of a node is influenced by data associated with this node (see e.g. an example of such a method in
A category of files is to be understood as a group of files sharing similarities (in other words, the system detects that the content of Dsignal representative of each of these files has some similarities, such as similar bytes stored in Dsignal, etc.—methods for detecting these similarities will be provided hereinafter).
A category of files is not necessarily an “explicit” category (that is to say that it is not always possible to give a name to the category which would summarize the common features of these files and would be of interest for the user, such as “invoice”, “receipt”), but in any case, when files are associated to a node, the system has identified that these files share similarities in their data Dsignal representative thereof, and therefore can be classified into a common category. The system can then perform various post-processing actions based on this knowledge, as explained hereinafter.
According to some embodiments, an analysis of these categories can be performed in order to understand the “name” or the “nature” of the category, in particular in the end nodes of the clustering structure, for which the differentiation is the highest. Examples of names of categories of files can include e.g.: files of the same nature (invoice category, receipt category, legal documents category, etc.), different release of the same file (e.g. file X version 1, file X version 1.1, etc. will correspond to a common category “file X”), files sharing common extension (executable files, pdf file, etc.), files storing similar content (files about history, files about geography, files about politics, etc.). These examples are not limitative and various other categories of files can be identified based on the use case.
In the intermediate nodes of the clustering structure, the nodes are also built to comprise files sharing similarities (that is to say category of files), however, these nodes do not necessarily correspond to explicit or useful categories which can be used by a user to classify these files. This is due to the fact that the acceptance threshold in these intermediate nodes is more flexible. In any case, although in the intermediate nodes an explicit name of the category is not necessarily identifiable, the system has identified that these files share similarities and this is useful to build the next nodes up to the end nodes which are more specific.
As explained hereinafter, each node Nj,L
The association of the node with the data can be stored temporarily (e.g. during at least some operations performed during the building of the clustering structure) for some nodes, and for other nodes, can be stored even after completion of the building of the clustering structure (e.g. for future use).
Specific examples will be provided hereinafter.
In addition, association of data with nodes can evolve over time, for example because new data are received, and/or because operations are performed to fine tune the definition of the categories represented by the nodes, as explained hereinafter.
Possible operations that can be performed for building a clustering structure will now be described.
The method can comprise building at least one, or a plurality of nodes Nj,L
Initially, if level L1 does not comprise any node, when first data (for instance, Dsignal, 1) is obtained, the method can comprise creating a first node N1,L
As explained hereinafter in the specification, each node Nj,L
In addition, each node Nj,L
Embodiments of methods of computing centroid CN
In some embodiments, acceptance threshold TN
In some embodiments, acceptance threshold TN
In other words, data itself can influence the acceptance threshold of the node to which it belongs.
In particular, in some embodiments, a user does not need to set any threshold for at least some of (or all) the nodes, and the data itself dictates the threshold which is dynamically updated based on data associated with the node.
Concerning the centroid CN
In some embodiments, CN
For example, an average of the data associated with a category represented by a node Nj,L
This is not limitative and other functions can be used to determine the centroid based on data associated with this node, such as: average Mahalanobis distance relative to a center of mass, sample that minimizes the distance to all others (Cross-distance matrix), etc. These examples are not limitative.
In the example of
Assume a simple example in which the acceptance threshold is predefined for all nodes of level L1. As mentioned above, this is not limitative.
Assume in this example that centroid CN
The method can comprise obtaining further data (e.g. Dsignal, 2 to Dsignal, Z) and attempting to cluster this data.
For the first level L1, this can comprise (operation 400) comparing each data (Dsignal, 2, . . . , Dsignal, Z) to each centroid CN
If a comparison between this data and a centroid CN
For a given data, if this comparison matches the acceptance threshold TN
For example, the selected node can be the node for which the comparison matches the best the acceptance threshold TN
For example, assume the acceptance threshold requires a matching of TN
Since the data is now associated with this selected node, the method can comprise updating (operation 430) the centroid of this selected node. This update takes into account the new data that has been associated to this selected node at this stage.
For example, if the centroid is determined based on an average of the data associated to this node, then the updated centroid can be determined by taking into account this data in the computation of the new average.
If another function F is used to determine the centroid based on the data associated with the node, then this function F can be used to take into account this new data for updating the centroid. For example, assume that data Dold associated with a node has been used to calculate the centroid, and that new data Dnew is now associated with this node, then the new centroid can be calculated with a function F(Dold, Dnew). In some embodiments, and as described in
If the comparison did not meet an acceptance threshold TN
The method can comprise storing in a memory, at least temporarily, an information representative of the fact that this data is now associated with this new node.
The centroid of this new node can be calculated based on this data.
If the centroid of a node is calculated based on an average of data corresponding to this node, then since this new node is at this stage only associated with this new data, the centroid of this new node can be set equal to this new data.
In the example of
When data Dsignal, 2 is processed, only node N1,L
Assume that a comparison of Dsignal, 2 with centroid CN
The method can comprise storing in a memory, at least temporarily, an information indicating that Dsignal, 2 is associated with node N1,L
As a consequence, centroid CN
After data Dsignal, 2 has been processed, Dsignal, 3 can be processed.
Assume that a comparison of Dsignal, 3 with centroid CN
The method can comprise creating a new node N2,L
Centroid CN
If the centroid of a node is calculated based on an average of data corresponding to this node, then since this new node N2,L
When data Dsignal, 4 is processed, two nodes N1,L
If the centroid of a node is calculated based on an average of data corresponding to this node, then since this new node N3,L
After all data has been processed, a plurality of nodes can be created in level L1 (in some embodiments, only one node can be created—this is however not limitative).
In the example of
Each data of the data set is associated with one node of level L1. Each node is associated with a centroid reflecting data that has been associated with this node. In addition, each node is associated with an acceptance threshold.
Attention is now drawn to
As explained above, the centroid of a node can be calculated based on data associated with this node. Assume a function F(data) is used to determine the centroid of a node.
Assume that the centroid CN
It is now desired to update CN
According to some embodiments, the centroid CN
In a computer, calculation of current data generally relies on data stored in the random access memory (RAM).
This method avoids the need of importing each time the whole set of old data Dold in the RAM of the computer.
A simple example will now be provided when function F is an average function. However, this is not limitative, and the method can be used for other functions. Assume Dold comprises data D1 to DK and Dnew is data DK+1. Therefore,
The following relationship can be established:
In light of the foregoing, new centroid CN
According to some embodiments, the building method can comprise a verification method. This verification method can comprise operations to improve the precision of the clustering of the data into a plurality of nodes in a level (this method can be used for the first level, and/or also for other levels). Indeed, it may occur that given data is associated with a node due to its time of arrival (that is to say the time at which it was processed) but in fact, this data should be associated with another node (which e.g. was not yet created at the time this data was processed), or should be associated with a new node.
A possible embodiment of such a verification method is described in
After all nodes of level L1 have been created (and each data Dsignal, 1, . . . , Dsignal, Z has been associated with a node), a given number of nodes Nj,L
The verification method can comprise attempting to associate each data of said plurality of data (Dsignal, 1, . . . , Dsignal, Z) with a node Nj,L
In other words, the verification method comprises performing again a process of assigning data to the nodes, using the nodes that were created during the building process. The verification method differs from the previous iteration (building method described in
The verification method comprises (operation 705) comparing each data of said plurality of data (Dsignal, 1, . . . , Dsignal, Z) with the centroid CN
Similarly to the process described in
For a given data, if this comparison matches the acceptance threshold TN
If the comparison did not meet an acceptance threshold TN
The centroid of this new node can be calculated based on this data.
According to some embodiments, operation 720 (performed during the verification method) can differ from operation 420 in that following operation 720, the centroid of the node is not updated (however, see hereinafter that the centroid can be updated after completion of one iteration of the whole verification method), whereas following operation 420, the centroid of the node is generally updated accordingly (as shown in operation 730).
According to some embodiments:
This is however not mandatory and in some embodiments, the centroid of all nodes can be updated progressively following operation 720.
Operations 705, 710, 720 or 740 can be repeated until all data Dsignal, 1, . . . , Dsignal, Z has been processed and (possibly) associated with a node.
Following one iteration of the verification method, the method can comprise updating the centroid of the nodes (see operation 810 in
Following one iteration of the verification method, the method can comprise updating the threshold of the nodes (see operation 800 in
A possible embodiment of updating threshold of the nodes is described with reference to
It has to be noted that this method can be used after the verification method, but can also be used at different stages of the building process of the clustering structure, or at different stages of the update of the clustering structure when new data are received, and for any level of the clustering structure.
In addition, it is possible to omit the verification method and to update directly the acceptance threshold of the nodes (for example once all relevant data has been assigned to a node using the method of
Assume a threshold TN
The method can comprise, at a given time t, determining (operation 900) data (hereinafter Dlow) associated with a node Nj,L
In other words, this data Dlow was identified as sufficiently matching the category of the node (assume the comparison of this data Dlow with the centroid of this node provided a matching equal to Tlow, with Tlow matching threshold TN
The method can comprise (operation 910) updating the threshold of the node based on Dlow. In particular, the threshold TN
Following one iteration of the verification method, different scenarios can generally occur.
In some cases (scenario 1), the number of nodes following iteration of the verification method is not the same as the number of nodes obtained following the method of
Concerning the nodes that already existed following the method of
In some cases (scenario 2), the number of nodes following iteration of the verification method is the same as the number of nodes obtained following the method of
In some cases (scenario 3), the number of nodes following iteration of the verification method is the same as the number of nodes obtained following the method of
In some cases (scenario 4), the number of nodes following the iteration of the verification method is lower than the number of nodes obtained following the method of
In some cases (scenario 5), following the verification method, the nodes and their parameters (data associated with the nodes, centroid and threshold) are the same as following the method of
In the example of
According to some embodiments, the verification process can be performed more than one time.
In particular, in at least one of scenarios 1, 2, 3 and 4, the verification process can be repeated. Concerning scenario 5, since the verification process did not change any of the nodes, it is not useful to repeat again the verification process (indeed, this can indicate that the verification process has already converged).
In some embodiments, the verification process can be repeated (operations 705, 710, 720 or 740) until a convergence is obtained, that it to say that between two iterations, nodes and parameters of the nodes remain the same. This is however not mandatory.
It has been described that a plurality of nodes can be created for the first level. It has to be noted that according to some embodiments, it is not necessary to predefine a threshold for the nodes of the first level (for example, this could be equal to zero). Indeed, as mentioned e.g. with reference to
Once first level L1 has been created, additional level(s)/layer(s) can be created.
Attention is drawn to
The method can comprise applying a method similar to the method of
Assume nodes of level Li−1 were already created, and that node(s) of level Li need to be created (for example L1 was created, and L2 needs to be created).
Assume level Li−1 comprises nodes Nj,L
Assume Nk,L
Based on node Nk,L
Data associated to parent node Nk,L
When first data Dsignal, P associated to parent node Nk,L
Operation 1001 is similar to operation 401.
Centroid of new node N1,L
Concerning the acceptance threshold of nodes Nj,L
According to some embodiments, this acceptance threshold can be predefined by a user, or pre-stored in a memory.
The method can further comprise processing other data (Dsignal, P+1, . . . , Dsignal, P+M) associated with parent node Nk,L
For each of this data, the method can comprise comparing (operation 1005) each data (Dsignal, P+1, . . . , Dsignal, P+M) to each centroid CN
If a comparison between this data and a centroid CN
For a given data, if this comparison matches the acceptance threshold TN
For example, the selected node can be the node for which the comparison matches the best the acceptance threshold TN
Since the data is now associated with this selected node, the method can comprise updating (operation 1030) the centroid of this selected node. This update takes into account the new data that has been associated to this selected node at this stage.
For example, if the centroid is determined based on an average of the data associated to this node, then the updated centroid can be determined by taking into account this data in the computation of the new average.
If another function F is used to determine the centroid based on the data associated to the node, then this function F can be used to take into account this new data for updating the centroid.
If the comparison did not meet an acceptance threshold TN
The method can comprise storing in a memory, at least temporarily, an information representative of the fact that this data is now associated with this new node.
In addition, the method can comprise storing in a memory the link between this new node and parent node Nk,L
The centroid of this new node can be calculated based on this data.
If the centroid of a node is calculated based on an average of data corresponding to this node, then since this new node is at this stage only associated with this new data, the centroid of this new node can be set equal to this new data.
Operations 1005, 1010, 1020, 1030 (or 1040) can be repeated for each data associated with parent node Nk,L
As a consequence, children nodes associated with the parent node can be obtained in level Li. This can be performed for each parent node Nk,L
A non-limitative example of the method of
As shown, it is attempted to create children nodes for parent node N3,L
At the beginning, when data Dsignal, 2 is processed, there is no children node associated to parent node N3,L
When data Dsignal, 4 is processed, node N1,L
Similarly to what was described in
This verification method is similar to the method described above with reference to
One main difference is that in
This can be seen for example in
The verification method can comprise (after a first estimation of nodes Nj,L
Operation 1205 is generally similar to operation 705 and one can refer to the description of operation 705.
If a comparison between the data and a centroid CN
For a given data, if this comparison matches the acceptance threshold TN
Operation 1210 is similar to operation 710 and one can refer to the description of operation 710.
Operation 1220 is similar to operation 720 and one can refer to the description of operation 720.
If the comparison did not meet an acceptance threshold TN
The centroid of this new node can be calculated based on this data.
As already mentioned with reference to
Concerning the acceptance threshold, as already mentioned with reference to
As already mentioned above, the verification method can be repeated more than once.
In the example of
The method of
For example, in
In some cases, some of the parent nodes of level Li−1 will not provide additional children nodes in level Li (this can indicate that for this path the differentiation between the data is already precise enough in level Li−1) whereas some parent nodes will still provide additional children nodes (this indicates that data can be further differentiated).
Various methods can be used to indicate at which stage the building of the clustering structure can be stopped.
According to some embodiments, it can be defined (e.g. by a user, or as a pre-stored rule in memory 220) that building of the structure is stopped when one or more of the following condition(s) is/are met.
For all nodes for which all data belonging to this node matches the centroid of this node with a level of matching which complies with a stopping threshold TS, then it is not attempted to build any more children nodes for these nodes using the method of
For example, it can be defined that when data of nodes meets the corresponding centroid of their node with a level of matching which is equal or higher than a stopping threshold of TS=0.9 or 0.99, then building of the clustering structure can be stopped (meaning that at this step it is not attempted to create further nodes in additional levels). These values are however not limitative.
It has been mentioned above (see
In other embodiments, it can be defined (e.g. by a user, or as a pre-stored rule in memory 220) that building of the clustering structure is stopped when the number of levels meets a threshold.
In some embodiments, the acceptance threshold of all nodes of a level can be pre-set (and is not necessarily updated based on the data). For example, it can be set that for level L1, the acceptance threshold is K1, for level L2, the acceptance threshold is K2, etc. (with Ki+1 being more strict than Ki).
In this case, it can be decided that the building of the clustering structure is stopped when a minimal number M of levels has been created (for all paths, or for at least some of the paths). This indicates that the data of level M all meets the centroid of their node with a level of matching which complies with the predefined acceptance threshold KM. If the user indicates a value for KM, then the system can automatically calculate M and can instruct when the building of the clustering structure should be stopped.
A non-limitative example of a clustering structure is provided in
As explained above, it is not necessary to define a priori the number of categories and their content. Once the nodes have been created, this indicates that each node represents a category but the system does not necessarily know at this stage the content of this category. For example, assume a plurality of files of a firm have been clustered using the methods described above. The end nodes of the clustering structure will automatically each represent a different category (for example, a first end node will comprise “receipts”, a second end node will comprise “invoices”, etc. but the system does not necessarily have an a priori knowledge of the name of each category). In some embodiments, the nature/name of each category can be deducted e.g. by the system from the content of the data stored in a given node. Assume that at least some of the files of an end node have a tag indicating that they belong to a receipt or to an invoice file. Then if this tag “receipt” is detected in one file or in a plurality of files of the node, this indicates that this node represents “receipts”. If the tag “invoice” is detected in one file of the node, this indicates that this node represents “invoice”, etc. In other words, the system can deduct the nature of a node based on characteristics of data associated with this node.
Attention is now drawn to
Assume a clustering structure has been built using the various methods described above (see reference 1400). As already explained, this clustering structure comprises comprising a plurality of nodes Nj,L
The method comprises obtaining (operation 1401) data Dsignal representative of a piece of data Draw to be assigned to a category. Various examples have been provided above for Dsignal and Draw. For example, Draw is a file and Dsignal is a vector or matrix representative of the binary content of this file. This is however not limitative.
It is now desired to cluster this data using the clustering structure. This data is typically new data that was not used in the data set from which the clustering structure was built using the methods described above. Indeed, if this data was already processed during the building of the clustering structure, then the system can detect that similar data is already associated e.g. with an end node of the clustering structure, and can output the corresponding category.
The method comprises, for i=1, comparing Dsignal to each centroid CN
In other words, it is attempted to identify which node of the first level L1 matches the best data Dsignal.
If this comparison meets an acceptance threshold TN
If the comparison meets an acceptance threshold TN
According to some embodiments, centroid CN
According to some embodiments, acceptance threshold TN
At this stage, it has been identified that data Dsignal belongs to a category represented by Np,L
However, if the comparison (operation 1404) does not meet an acceptance threshold TN
According to some embodiments, the method can comprise “rejecting” the data (operation 1410). This can comprise providing an output that Dsignal or Draw does not belong to any of the categories of the clustering structure. This output can be e.g. provided to a user through user interface 230.
According to some embodiments, the method can comprise creating (operation 1409) a new node in the level and associating data Dsignal or Draw with this new node.
In some embodiments, an output can be provided (e.g. to the user) indicating that Dsignal or Draw belongs to a category represented by a new node.
According to some embodiments, a centroid can be calculated for this new node based on Dsignal. Embodiments for calculating the centroid of a node have been provided above.
In addition, an acceptance threshold can be assigned to this new node. This acceptance threshold can be set by a user, or can be predefined for all nodes of this level.
When data is associated to an existing node of the clustering structure of the first level, it can be attempted to identify which nodes of the subsequent level(s) Li, with i>1, (and which are linked to the node identified in the previous level) match the best data Dsignal.
The method can comprise increasing i by one (see reference 1420—therefore i>1) and comparing Dsignal to each centroid CN
If this comparison meets an acceptance threshold TN
If the comparison meets an acceptance threshold TN
In some embodiments, parameters of node Np,L
According to some embodiments, centroid CN
According to some embodiments, acceptance threshold TN
At this stage, it has been identified that data Dsignal belongs to a category represented by Np,L
The method can be repeated iteratively, by reverting to operation 1403.
However, if the comparison does not meet an acceptance threshold TN
According to some embodiments, the method can comprise “rejecting” the data (operation 1410).
This can comprise providing an output that Dsignal or Draw does not belong to any of the categories of level of the clustering structure. This output can be e.g. provided to a user through user interface 230.
This can comprise providing an output that Dsignal or Draw does not belong to any of the sub-categories represented by node Np,L
If the user is interested only in the categories represented by the end nodes (nodes which do not have children nodes in subsequent levels and which represent the narrowest categories), and data Dsignal could not be assigned to any of these end nodes, the method can comprise providing an output that Dsignal or Draw does not belong to any of the relevant categories of the clustering structure. This output can be e.g. provided to a user through user interface 230.
According to some embodiments, the method can comprise creating (operation 1409) a new node in level Li and associating data Dsignal or Draw with this new node. This new node is linked to parent node Np,L
According to some embodiments, a centroid can be calculated for this new node based on Dsignal. Embodiments for calculating the centroid of a node have been provided above.
In addition, an acceptance threshold can be assigned to this new node. This acceptance threshold can be set by a user, or can be predefined for all nodes of this level. In some embodiments, it can be set equal to the acceptance threshold of the parent node Np,L
In some embodiments, an output can be provided (e.g. to the user) indicating that Dsignal or Draw belongs to a category represented by a new node.
The method described in
According to some embodiments, when i=imax, the method can be stopped (reference 1406).
According to some embodiments, imax is reached when an end node of the clustering structure has been reached. In other words, this means that node Np,L
According to some embodiments, imax is reached when the comparison of operation 1404 meets an acceptance threshold TN
In
In
In
Once a relevant node (assume the relevant node is Np,L
Various methods for clustering data Dsignal based on the clustering structure, have been described.
If a plurality of data Dsignal is obtained (e.g. Dsignal,1, . . . Dsignal,Z′) and need to be clustered, according to some embodiments, each data can be processed individually according to the various methods described above (see
According to other embodiments, if a plurality of data Dsignal is obtained (e.g. Dsignal,1, . . . Dsignal,Z′), this data can be processed similarly to what was performed for building the clustering structure. In other words, instead of processing each data individually until it reaches an end node of the clustering structure, the data set can be processed together at each level, similarly to the building process described in
This method can comprise, for each data of data set (Dsignal,1, . . . Dsignal,Z′):
During these operations, centroid and threshold of the nodes can be updated as already explained in the various embodiments above.
Once each data has been associated to nodes of the first level, the method can then attempt to identify nodes of the subsequent levels which match each data. This can comprise, for each data of data set (Dsignal,1, . . . Dsignal,Z′), and for each level Li, with i>1:
During these operations, centroid and threshold of the nodes can be updated as already explained in the various embodiments above.
When a data reaches an end node, the method can be stopped for this data, since this indicates that the relevant category has been obtained. Other criteria can be used to assess when the method can be stopped, as already explained in the various embodiments above.
It thus appears that this method combines clustering of data and training/update of the clustering structure using a plurality of (new) data.
Assume now that the clustering structure was built using a data set Dold. Assume now that a new data set Dnew is received. If necessary, the same method can be used by processing an aggregated data set comprising Dold and Dnew. In other words, it is attempted to re-cluster even the old data, together with the new data.
Generally, the clustering structure is stable enough, since it was trained using a large data set relatively to the new data set, and therefore it is sufficient to cluster only each new data individually. This is however not limitative.
The clustering method described can be used for various applications. In some embodiments, it is attempted to cluster files (e.g. text files, executable files, presentations, etc.). Assume a clustering structure was built using a large data set obtained from a scan of one or more servers of a firm. Then, periodically, the method can comprise scanning the server to get new files. If data Dsignal which is obtained was already clustered in the past (this can be detected by comparing the obtained data with the data already associated with the end nodes of the clustering structure), then a corresponding output can be produced, indicating that the category of this data is known.
If data Dsignal which is obtained is unknown to the clustering structure, then the clustering method can be applied, and a corresponding output (e.g. category, or rejection) can be produced. This periodic scan of the content of the servers of the firm can be performed e.g. every day, or every week, but this is not limitative.
Attention is drawn to
According to some embodiments, the method can comprise identifying or tagging data Dsignal or Draw/Dfile (see operations 1800-1810). In particular, this can be performed based on data belonging to a category represented by node Np,L
Indeed, a memory (e.g. memory 220) can store, for node Np,L
Assume this data has some characteristics CT. Since Dsignal or Draw/Dfile has been identified as belonging to Np,L
Assume for example that node Np,L
In some embodiments, if at least some of these files have a certain common tag (e.g. a tag which represents an invoice, salaries of employees, budget of the firm, internal report, specific policies associated with this file, such as list of persons who should receive this file, etc.—this list is not limitative), then the method can comprise tagging Dsignal or Draw/Dfile with the same tag. Therefore, a powerful tool is provided to automatically tag Dsignal or Draw/Dfile.
Operations which can be performed for automatically tagging data (such as a file) are described in
This can comprise obtaining (operation 1820) a clustering structure in which each end node (nodes are not linked with “children” nodes in the subsequent levels) of the clustering structure is associated with one or more data. Assume that for each end node, at least one of this data is associated with at least one tag.
When a new data is received and has to be clustered, it is attempted to identify an end node of the clustering structure (operation 1830) which best matches this new data (various methods have been described above for clustering new data based on the clustering structure).
Assume end node N has been identified, which is associated with one or more data Dold. Assume that at least one data Dold is tagged with tag T.
The new data can be tagged (operation 1840) with the same at least one tag T.
Each time a new data is received and associated with an end node, a corresponding tag of data associated with this end node can be determined and can be used to automatically tag this new data.
In some cases, after building of the clustering structure, each end node can be associated with one or more tags (based on tagged data that was clustered in this end node during building of the clustering structure and/or during update of the clustering structure). Then, each time new data is associated with this end node, it can be automatically tagged accordingly.
Attention is drawn to
According to some embodiments, the method can comprise managing an access to Dsignal and/or Draw based on the category or node identified for Dsignal/Draw.
In particular, this can comprise e.g. providing an access to Dsignal and/or Draw which is limited depending on a profile of a user.
An example can be that Draw is a file. Files identified as salaries of employees can be opened only by the management of the firm and by the employee himself. Files identified as secret documents can be opened only by users with the relevant authorization. Files identified as general documents of the firm can be opened by any employee of the company but not by persons who are external to the company. This example is not limitative.
Another application of the clustering method can be an identification of similar release of a file. Generally, in a company, a given file is created and then updated during time. For example, a file is created as file F0 at time t0, then updated to file F1 at time T1, file F2 at time T2, etc.
The clustering structure can be trained to comprise end nodes which reflect files which correspond to different releases of the same original file.
For example, the clustering structure can be built based on file F0 and therefore an end node corresponding to this file can be built.
Then, when the system receives files F1 and F2, it can detect that they belong to the same category as file F0. The user can thus receive an output indicating that files F1 and F2 belong to the same category as file F0, and therefore, are different releases of the same file.
Attention is drawn to
As shown, the first level comprises two nodes. The acceptance threshold of the first node is equal to 0.31 and the acceptance threshold of the second node is equal to 0.35.
The second level comprises four nodes. The acceptance threshold of the first node (linked to the first node of the first level) is equal to 0.55, the acceptance threshold of the second node (linked to the first node of the first level) is equal to 0.65, the acceptance threshold of the third node (linked to the second node of the first level) is equal to 0.95 and the acceptance threshold of the fourth node (linked to the second node of the first level) is equal to 0.95.
The third level comprises four nodes. The acceptance threshold of the first node (linked to the first node of the second level) is equal to 0.9, the acceptance threshold of the second node (linked to the first node of the second level) is equal to 0.9, the acceptance threshold of the third node (linked to the second node of the second level) is equal to 0.99 and the acceptance threshold of the fourth node (linked to the second node of the second level) is equal to 0.65 (this is due to the fact that the acceptance threshold of this node was set equal to its parent node, and since only one file was associated with this node, this acceptance threshold was not updated).
One can see that the higher the level in the clustering structure, the higher the acceptance threshold (since finer clustering is obtained).
If, for example, in the second node of third level, one of the files is tagged as an invoice of suppliers, it can be deduced that all files of these nodes are invoices of suppliers, and can be tagged or handled or categorized accordingly. This is however not limitative.
Attention is now drawn to
Assume that a first clustering structure was built for files of server A, and that a second clustering structure was built for files of server B. Assume that for security reasons, files of server A should be separated from files of server B. In other words, files of server A should be not accessed by server B (and in some embodiments conversely).
Assume a new file is received by server B, which is in fact a file which is authorized only to server A. The clustering method applied at server B will indicate that this file belongs to a category which is unknown to server B, and therefore, should be rejected.
It is to be noted that the various features described in the various embodiments may be combined according to all possible technical combinations.
It is to be understood that the invention is not limited in its application to the details set forth in the description contained herein or illustrated in the drawings. The invention is capable of other embodiments and of being practiced and carried out in various ways. Hence, it is to be understood that the phraseology and terminology employed herein are for the purpose of description and should not be regarded as limiting. As such, those skilled in the art will appreciate that the conception upon which this disclosure is based may readily be utilized as a basis for designing other structures, methods, and systems for carrying out the several purposes of the presently disclosed subject matter.
Those skilled in the art will readily appreciate that various modifications and changes can be applied to the embodiments of the invention as hereinbefore described without departing from its scope, defined in and by the appended claims.
Patent | Priority | Assignee | Title |
12086174, | Apr 10 2020 | Nippon Telegraph and Telephone Corporation | Sentence data analysis information generation device using ontology, sentence data analysis information generation method, and sentence data analysis information generation program |
Patent | Priority | Assignee | Title |
7251637, | Sep 20 1993 | Fair Isaac Corporation | Context vector generation and retrieval |
7814078, | Jun 20 2005 | Hewlett Packard Enterprise Development LP | Identification of files with similar content |
8270733, | Aug 31 2009 | MOTOROLA SOLUTIONS, INC | Identifying anomalous object types during classification |
20060080311, | |||
20080205775, | |||
20080294651, | |||
20090037440, | |||
20140079316, | |||
20160012343, | |||
20160299920, | |||
20170235820, | |||
WO2011070832, |
Executed on | Assignor | Assignee | Conveyance | Frame | Reel | Doc |
Oct 08 2018 | MINEREYE LTD. | (assignment on the face of the patent) | / | |||
Nov 21 2018 | ATIAS, AVNER | MINEREYE LTD | ASSIGNMENT OF ASSIGNORS INTEREST SEE DOCUMENT FOR DETAILS | 052026 | /0902 | |
Nov 21 2018 | AVIDAN, YANIV | MINEREYE LTD | ASSIGNMENT OF ASSIGNORS INTEREST SEE DOCUMENT FOR DETAILS | 052026 | /0902 | |
Aug 01 2024 | MINEREYE LTD | MINEREYE TECHNOLOGIES LTD | ASSIGNMENT OF ASSIGNORS INTEREST SEE DOCUMENT FOR DETAILS | 068987 | /0868 |
Date | Maintenance Fee Events |
Oct 08 2018 | BIG: Entity status set to Undiscounted (note the period is included in the code). |
Oct 29 2018 | SMAL: Entity status set to Small. |
Aug 16 2024 | M2551: Payment of Maintenance Fee, 4th Yr, Small Entity. |
Date | Maintenance Schedule |
Feb 16 2024 | 4 years fee payment window open |
Aug 16 2024 | 6 months grace period start (w surcharge) |
Feb 16 2025 | patent expiry (for year 4) |
Feb 16 2027 | 2 years to revive unintentionally abandoned end. (for year 4) |
Feb 16 2028 | 8 years fee payment window open |
Aug 16 2028 | 6 months grace period start (w surcharge) |
Feb 16 2029 | patent expiry (for year 8) |
Feb 16 2031 | 2 years to revive unintentionally abandoned end. (for year 8) |
Feb 16 2032 | 12 years fee payment window open |
Aug 16 2032 | 6 months grace period start (w surcharge) |
Feb 16 2033 | patent expiry (for year 12) |
Feb 16 2035 | 2 years to revive unintentionally abandoned end. (for year 12) |