A computer processing method includes receiving network data; filtering a node with a degree higher than a predefined threshold in the network data; storing the filtered node and its neighborhood relationship; clustering the filtered network data to obtain primary group(s); and obtaining a final group based on the filtered node and its neighborhood relationship and the primary group(s). The computer processing method and a corresponding system can be applicable to the processing for network data of large scale, and reduce greatly the processing time for clustering the network data of large scale, and the invention can be implemented well in parallel.
|
1. A computer processing method for network data, comprising:
receiving network data;
filtering a node with a degree higher than a predefined threshold from the network data;
storing the filtered node and its neighborhood relationship;
clustering the filtered network data to obtain at least one primary group excluding the filtered node; and
obtaining at least one final group by adding the filtered node to the at least one primary group based on the stored neighborhood relationship.
10. A computer system for processing network data, comprising:
memory; and
at least one processor coupled to said memory, the at least one processor being configured:
to receive network data;
to filter a node with a degree higher than a predefined threshold in the network data;
to store the filtered node and its neighborhood relationship;
to cluster the filtered network data to obtain at least one primary group excluding the filtered node; and
to obtain a final group by adding the filtered node to the at least one primary group based on the stored neighborhood relationship.
19. A computer program product comprising a non-transitory computer readable storage medium having computer readable program code embodied therewith, said computer readable program code comprising:
computer readable program code configured to receive network data;
computer readable program code configured to filter a node with a degree higher than a predefined threshold in the network data;
computer readable program code configured to store the filtered node and its neighborhood relationship;
computer readable program code configured to cluster the filtered network data to obtain at least one primary group excluding the filtered node; and
computer readable program code configured to obtain a final group by adding the filtered node to the at least one primary group based on the stored neighborhood relationship.
2. The method according to
based on the stored neighborhood relationship, establishing a mapping between the filtered node and the at least one primary group;
determining whether the filtered node belongs to the at least one primary group; and
in response to determining that the filtered node belongs to the at least one primary group, merging the filtered node into the at least one primary group.
3. The method according to
in response to merging all the filtered nodes into their corresponding primary groups, regarding the primary groups as the final group.
4. The method according to
clustering subnetwork data composed by the filtered nodes to form a new group; and
incorporating the new group into the final group.
5. The method according to
calculating degree distribution of all the nodes in the network data; and
selecting a degree of any node from a certain percentage range of nodes with high degrees in all the nodes, as the predefined threshold.
6. The method according to
7. The method according to
determining the primary groups including at least one node in the neighborhood relationship of the filtered node; and
associating the filtered node with the determined primary groups.
8. The method according to
based on the stored neighborhood relationship, establishing a mapping between the filtered node and the at least one primary group;
determining whether the filtered node belongs to the at least one primary group; and
in response to determining that the filtered node belongs to the at least one primary group, merging the filtered node into the at least one primary group;
and wherein the determining whether the filtered node belongs to the at least one primary group includes:
calculating an average degree of the nodes in the at least one primary group;
calculating an actual association degree of the filtered node with the nodes in the at least one primary group;
determining whether the actual association degree is larger than the average degree; and
in response to determining the actual association degree is larger than the average degree, determining the filtered node belongs to the at least one primary group.
9. The method according to
11. The computer system according to
to, based on the stored neighborhood relationship, establish a mapping between the filtered node and the at least one primary group;
to determine whether the filtered node belongs to the at least one primary group; and
to, in response to determining that the filtered node belongs to the at least one primary group, merge the filtered node into the at least one primary group.
12. The computer system according to
to, in response to merging all the filtered nodes into their corresponding primary groups, regard the primary groups as the final group.
13. The computer system according to
to cluster subnetwork data composed by the filtered nodes to form a new group; and
to incorporate the new group into the final group.
14. The computer system according to
to statistically calculate degree distribution of all the nodes in the network data; and
to select a degree of any node from a certain percentage range of nodes with high degrees in all the nodes, as the predefined threshold.
15. The computer system according to
16. The computer system according to
to determine the primary groups including at least one node in the neighborhood relationship of the filtered node; and
to associate the filtered node with the determined primary groups.
17. The computer system according to
to calculate an average degree of the nodes in the at least one primary group;
to calculate an actual association degree of the filtered node with the nodes in the at least one primary group;
to determine whether the actual association degree is larger than the average degree; and
to, in response to determining the actual association degree is larger than the average degree, determine the filtered node belongs to the at least one primary group.
18. The computer system according to
|
This application claims foreign priority to P.R. China Patent application 201110076719.X filed 29 Mar. 2011, the complete disclosure of which is expressly incorporated herein by reference in its entirety for all purposes.
The present invention generally relates to the information processing technology field, and in particular, to a computer processing method and system for network data.
Nowadays, as information technology, especially network technology, develops, information is transferred between respective information nodes, so lots of such network data reflecting the relation between information nodes exists on the network. With respect to the large amounts of network data and network data of large scale, there are many technical analysis requirements now, i.e., how to find the relationship between these information nodes, for example, detecting nodes having abnormal behavior from the network, or filtering junk e-mails, and so on.
However, when processing large scale network data including lots of nodes, for example when the nodes relating to network data to be processed reach 105 or larger, the existing technology seems to be inadequate, and even helpless.
Thus, it is desirable to provide a computer processing method and system for network data.
One aspect of the invention provides a computer processing method for network data, comprising: receiving network data; filtering a node with a degree higher than a predefined threshold in the network data; storing the filtered node and its neighborhood relationship; clustering the filtered network data to obtain primary group(s); and obtaining a final group based on the filtered node and its neighborhood relationship and the primary group(s).
Another aspect of the invention provides a computer system for processing network data, comprising: a receiving means, configured to receive network data; a filtering means, configured to filter a node with a degree higher than a predefined threshold in the network data; a storing means, configured to store the filtered node and its neighborhood relationship; a clustering means, configured to cluster the filtered network data to obtain primary group(s); and a final grouping means, configured to obtain a final group based on the filtered node and its neighborhood relationship and the primary group(s).
The computer processing method and system provided by the invention which can accelerate network data processing may be applicable to the processing for network data of large scale, and the processing time for clustering network data of large scale will be greatly reduced. The invention can also be parallelized, to facilitate its common embodiments.
The features and advantages of the embodiments of the invention will be particularly explained with reference to the appended drawings. If possible, the same or like reference number denotes the same or like component in the drawings and the description. In the drawings:
Below, the exemplary embodiments of the invention will be described in detail with reference to the drawings in which the embodiments of the invention are illustrated, and like reference number always indicates the same element. It should be understood that the invention is not limited to the disclosed exemplary embodiments. It should be also understood that not every feature of the method and apparatus is necessary for implementing the invention to be protected by any claim. In addition, in the whole disclosure, when displaying or describing the process or the method, the steps of the method can be executed in any order or simultaneously, unless it is clear from the context that one step depends on another previously-executed step. In addition, there may be a prominent time interval between the steps.
Generally, the association extent between nodes in network data is referred to as a degree by a person skilled in the art. For example, if a node V1 is associated with 5 other nodes, it can be considered that the node V1 has a degree of 5 in the network data. If each node in the network data is considered as a point, lines are connected between nodes which are associated to form a graph (also referred to interchangeably as a map). Embodiments of the invention are applicable to both directional network data and un-directional network data. It is particularly noted by the inventor during study and practice that, in network data of large scale, the associations between nodes are not usually uniform, some nodes are tightly associated with other many other nodes, but most of the nodes are associated with only a few nodes. Just based on this natural non-uniformity, the inventor proposed the invention in a new way.
In step 203, a node with a degree higher than a predefined threshold in the network data is filtered. For setting the predefined threshold, a different predefined threshold can be set by the person skilled in the art according to particular dataset, and the predefined threshold can be an absolute value of the degree. In addition, it can be also considered to filter a certain percentage of nodes. In particular, the degree distribution of all the nodes in the network data is statistically calculated, and preferably, the degrees of all the nodes can be ordered in an ascending order or a descending order. A degree of any node from a certain percentage range (preferably, the first 5.5%-1%) of nodes with high degrees in all the nodes is selected, as the predefined threshold.
In step 205, the filtered node and its neighborhood relationship are stored. In this step, the neighborhood relationship is represented by a set of nodes adjacent to the filtered node. For example, a node V16 is adjacent to nodes V15, V18, V19, V17 and V12, the node V16 is filtered, and the node V16 and its neighborhood relationship V15, V18, V19, V17 and V12, can be stored. The storage manner can include storing them in a memory or storing them in a non-volatile memory medium.
In step 207, the filtered network data is clustered to obtain a primary group(s). In this step, the network data which is represented by the nodes and the lines can be clustered to be grouped. The person skilled in the art can select any suitable clustering algorithm according particular data to obtain the primary group(s). For example, for the community discovery, the methods as proposed in reference document [1], or reference document [2], Fábio Protti, Felipe M. G. Franca, Jayme Luiz Szwarcfiter, On Computing All Maximal Cliques Distributedly, Proceedings of the 4th International Symposium on Solving Irregularly Structured Problems in Parallel, 1997 (expressly incorporated herein by reference in its entirety for all purposes), can be used.
In step 209, a final group is obtained based on the filtered node and its neighborhood relationship and the primary group(s). In this step, the primary group(s) associated with the filtered node is determined based on the neighborhood relationship of the filtered node, and then it is further determined whether the filtered node belongs to a certain or some certain primary group(s), to finally obtain the final group.
In step 303, it is determined whether the filtered node belongs to the primary group(s). Preferably, an average degree of the nodes in the primary group(s) is calculated, in which, the average degree is the sum of the degree of all the nodes in the primary group(s) divided by the number of all the nodes in the primary group(s). And an actual association degree of the filtered node with respect to the nodes in the primary group(s) is calculated, in which, the actual association degree is the sum of the number of the lines between the filtered node and the nodes in the primary group(s). Whether the actual association degree is larger than the average degree is further determined, and in response to determining the actual association degree is larger than the average degree, it is determined that the filtered node belongs to the primary group(s). Of course, the person skilled in the art may conceive other embodiments for determining whether the filtered node belongs to the primary group(s) based on the application.
In step 305, in response to determining that the filtered node belongs to the primary group(s), the filtered node is merged into the primary group(s).
In step 307, it is judged whether all the filtered nodes are passed through, and if there is any filter node having not been processed, the steps 303-305 are repeatedly executed.
In step 309, in response to merging all the filtered nodes into their corresponding primary group(s), regarding the primary group(s) as the final group(s).
1) calculating a predefined threshold for filtering, statistically calculating the degree of each node and ordering them, taking the first 1% of them as the predefined threshold for filtering, the predefined threshold of the graph (map) being 5;
2) discovering the degree of the node V16 in the graph (map) larger than 5 (the degree of V16 being 6), and thus saving the node V16 and its neighborhood relationship {V15, V18, V19, V17, V12 and V17};
3) performing community discovery on all the nodes except the node V16, by using the method as described in the reference document [2], which has a basic concept that each round of iterations, similarities between two points of all the points within two hops (jumps) are determined, two points which are similar but do not have a line are connected with a line, two points which are not similar but have a line are disconnected, when the variation of the network topology is less than a certain threshold, the iteration end, otherwise, the iteration goes not the next round of nodes. A simple description about the method of the reference document [2] is performed here, and the details can be found in the reference document itself. The network as shown in
4) using the results stored in 2), according to the neighborhood of V16, it is found the above 3 primary groups G1, G2 and G3 all include the nodes adjacent to them, so the node V16 could belong to the three primary groups G1, G2 and G3; and
5) calculating the average degrees of G1, G2, G3 respectively. The average degrees of G1, G2, G3 are 1.5, 1.6 and 0.7, while the actual association degrees of the node V16 with G1, G2 and G3 are 1, 3 and 2 respectively. Since it is determined that actual association degrees of the node V16 with G2 and G3 are larger than the average degrees of G2, G3, it can be determined that the V16 will be merged into G2 and G3, to form the final group result as shown in
Each particular embodiments of the invention is applicable to various implementing flats, such as the network data clustering processing realized by a single-machine, the network data clustering processing realized by parallel computing flat such as MapReduce and MPI.
To realize the community discovery, the basic data structure of the network in MapReduce is a “two hop adjacency list”, i.e., each row uses nodes as keys, the adjacency table of the nodes and the adjacency table of each node in the adjacency table are used as a value; meanwhile, the similarities of the node with respect to all the nodes in the two hop adjacency list should be stored in the value, and a certain value field is reserved for storing information such as marks and so on. For example, the two hop adjacency list of a node A is A-C (A, B, D), B (A, C), in which one-hop (one-jump) neighbors of A are B and C, one-hop (one-jump) neighbors of B include A and C, and one-hop (one-jump) neighbors of C include A, B and D. Such data structure is to facilitate realization of the main clustering method as described in the reference document [1].
During a preprocessing stage, by one MapReduce job, the nodes with degrees larger than a designated threshold are marked (the degree resolving is easily realized by one Map task, and each node stores an adjacency table, and the degree is the number of the members in the adjacency table), and the marked data is used as the input to a “filter” and a “large degree node collector.”
During the main algorithm stage, a two hop adjacency list (two jumpadjacency matrix) set of the nodes with the output of the filter less than the designated threshold according to the main clustering method in the reference document [1], several rounds of iterations are performed to update the topology; each round of iteration uses a similarity calculator to obtain the similarities between nodes, and uses a topology updater to update the topology; and when the topology variation is less than the designated threshold, the iteration ends, and the main algorithm in the reference document [1] is completed.
During a post-processing stage, after the main algorithm is completed, a Connected Component Calculator is called to obtain the community corresponding to each node. In this regard, reference is made to X-RIME: Hadoop based large scale social network analysis, project available from SourceForge, expressly incorporated herein by reference in its entirety for all purposes, and in particular to a Weakly Connected Component implemented in X-RIME. At this time, a “group degree calculator” is called to calculate the average degree of each group. The key input by the “group degree calculator” is the nodes, value is the group number, the output key is the group, and the value is the average degree of the group together with the set of included nodes. Both the output (output 1) of the group degree calculator and the output (output 2) of the “large degree node collector” are used as the input of a “group selector” and the output of the “group selector” is the potential group(s) of the filtered node. During a Map stage, the “group selector” sends a {group, filtered node} key-value pair message to each neighbor of the filtered node according to the adjacency table of the filtered node, for example, if a node V has neighbors V1, V2, V3, V4 and V5, and V1 and V2 are grouped into g1,V3, V4 and V5 are grouped into g2, in this case, the “group selector” sends two <g1, V> to a reducer with g1 as a key, and sends three <g2,V> to a reducer with g2 as a key, so the number of the messages corresponding to V received in each group indicates the number of the neighbors of the node in the group, and the number is recoded as a label L. Further, a group clustering device may use the label L and the previously calculated group average degree to determine whether V really belongs to this group, and to finally obtain the final group result.
It should be understood that the above embodiments have been discussed with respect to a network of large scale, but embodiments of the invention are applicable to the network of normal scale, to obtain the corresponding gain. If the person skilled in the art will extend the method of the invention to other physical network data (such as sensor network(s) and so on) according to his or her professional knowledge, and adaptively modify various embodiments of the invention based on his or her knowledge in the art, which will be available too.
Preferably, the final grouping means 809 includes: a mapping means, configured to, based on the stored neighborhood relationship, establish a mapping between the filtered node and the primary group(s); a judging means, configured to determine whether the filtered node belongs to the primary group(s); and a merging means, configured to, in response to determining the filtered node belongs to the primary group(s), merge the filtered node into the primary group(s).
Preferably, the final grouping means 809 further includes: a final group determining means, configured to, in response to merging all the filtered nodes into their corresponding primary group(s), regard the primary group(s) as the final group.
Preferably, the computer system 800 further comprises: a new grouping means, configured to cluster subnetwork data composed by the filtered nodes to form a new group; and an incorporating means, configured to incorporate the new group into the final group.
Preferably, the computer system 800 further comprises: a statistically-calculating means, configured to statistically calculate degree distribution of all the nodes in the network data; and a predefined threshold determining means, configured to select a degree of any node from a certain percentage range (preferably, the first 5.5%-1%) of nodes with high degrees in all the nodes, as the predefined threshold.
Preferably, the neighborhood relationship is represented by a set of nodes adjacent to the filtered node.
Preferably, the mapping means includes: a primary group determining means, configured to determine the primary group(s) including at least one node in the neighborhood relationship of the filtered node; and an associating means, configured to associate the filtered node with the determined primary group(s).
Preferably, the judging means includes: an average degree calculating means, configured to calculate an average degree of the nodes in the primary group(s); an actual association degree calculating means, configured to calculate an actual association degree of the filtered node with the nodes in the primary group(s); a comparing means, configured to determine whether the actual association degree is larger than the average degree; and a determining means, configured to, in response to determining that the actual association degree is larger than the average degree, determine that the filtered node belongs to the primary group(s).
Preferably, the computer system 800 is configured on MapReduce calculating flat.
The function of each component in
Although the computer system described in
The invention can also be realized as a computer program product used by the computer system in
In view of the discussion of
Although the invention is described with reference to the preferred embodiments of the invention, it will be obvious by the person skilled in the art that without departing the spirit and scope of the invention defined by the appended claims, various modifications in form and detail can be performed on the invention.
Yang, Bo, Shi, Ju Wei, Xue, Wei, Wang, Wen Jie
Patent | Priority | Assignee | Title |
Patent | Priority | Assignee | Title |
7466663, | Oct 29 2002 | Inrotis Technology, Limited | Method and apparatus for identifying components of a network having high importance for network integrity |
7818272, | Jul 31 2006 | MICRO FOCUS LLC | Method for discovery of clusters of objects in an arbitrary undirected graph using a difference between a fraction of internal connections and maximum fraction of connections by an outside object |
20050021531, | |||
20090315890, | |||
20100022752, | |||
20100063973, | |||
20100309206, | |||
20100313205, | |||
20120143882, | |||
CN101661482, | |||
CN101944045, |
Executed on | Assignor | Assignee | Conveyance | Frame | Reel | Doc |
Mar 29 2012 | International Business Machines Corporation | (assignment on the face of the patent) | / | |||
Apr 02 2012 | SHI, JU WEI | International Business Machines Corporation | ASSIGNMENT OF ASSIGNORS INTEREST SEE DOCUMENT FOR DETAILS | 028582 | /0802 | |
Apr 21 2012 | WANG, WEN JIE | International Business Machines Corporation | ASSIGNMENT OF ASSIGNORS INTEREST SEE DOCUMENT FOR DETAILS | 028582 | /0802 | |
Apr 23 2012 | XUE, WEI | International Business Machines Corporation | ASSIGNMENT OF ASSIGNORS INTEREST SEE DOCUMENT FOR DETAILS | 028582 | /0802 | |
Jun 08 2012 | YANG, BO | International Business Machines Corporation | ASSIGNMENT OF ASSIGNORS INTEREST SEE DOCUMENT FOR DETAILS | 028582 | /0802 |
Date | Maintenance Fee Events |
Apr 15 2022 | M1551: Payment of Maintenance Fee, 4th Year, Large Entity. |
Date | Maintenance Schedule |
Oct 16 2021 | 4 years fee payment window open |
Apr 16 2022 | 6 months grace period start (w surcharge) |
Oct 16 2022 | patent expiry (for year 4) |
Oct 16 2024 | 2 years to revive unintentionally abandoned end. (for year 4) |
Oct 16 2025 | 8 years fee payment window open |
Apr 16 2026 | 6 months grace period start (w surcharge) |
Oct 16 2026 | patent expiry (for year 8) |
Oct 16 2028 | 2 years to revive unintentionally abandoned end. (for year 8) |
Oct 16 2029 | 12 years fee payment window open |
Apr 16 2030 | 6 months grace period start (w surcharge) |
Oct 16 2030 | patent expiry (for year 12) |
Oct 16 2032 | 2 years to revive unintentionally abandoned end. (for year 12) |