A computer-implemented method for determining a relevance of a node in a network. A digital representation of a local neighborhood structure of the node in the network is obtained in a computer-readable non-volatile memory. A numerical value characteristic of the node's relevance is determined, and output to a user. The numerical value is determined based on the neighborhood structure of the node.

Patent
   10579684
Priority
Jan 31 2014
Filed
Jan 30 2015
Issued
Mar 03 2020
Expiry
Feb 02 2037
Extension
734 days
Assg.orig
Entity
Small
0
15
currently ok
1. A computer-implemented method for determining a relevance of a particular node in a computer network, the method comprising:
obtaining a digital representation of a local neighborhood structure of the particular node in the computer network in a computer-readable non-volatile memory;
determining a numerical value characteristic of the particular node's relevance, and
outputting the numerical value to a user,
characterized in that the numerical value is determined based on the neighborhood structure of the particular node, and wherein
the numerical value is determined based on degrees of all sets of nodes in the computer network reachable from the particular node using at most k edges, where k is a fixed preset positive integer number, and based on a degree determined for each enumerated set, wherein the degree of a set is determined as a number of edges connecting nodes inside the set to nodes outside.
7. A method for monitoring a computer network, the method comprising:
obtaining a digital representation of a local neighborhood structure of a particular node in the computer network in a computer-readable non-volatile memory;
determining a numerical value characteristic of the particular node's relevance;
monitoring the particular node, if said particular node is relevant according to a given measure; and
outputting the numerical value to a user,
characterized in that the numerical value is determined based on the neighborhood structure of the particular node, wherein
the numerical value is determined based on degrees of all sets of nodes in the computer network reachable from the particular node using at most k edges, where k is a fixed preset positive integer number, and based on a degree determined for each enumerated set, wherein the degree of a set is determined as a number of edges connecting nodes inside the set to nodes outside.
9. A method for controlling an element in a computer network, the method comprising:
selecting a network element in a computer network;
obtaining a digital representation of a local neighborhood structure of the selected network element in the computer network in a computer-readable non-volatile memory;
determining a relevance of the selected network element;
generating a control signal, based on the determined relevance; and
sending the control signal to the selected network element,
characterized in that the relevance is determined based on said local neighborhood structure of the selected network element is determined as a numerical value determined based on degrees of all sets of nodes in the computer network reachable from the network element using at most k edges, where k is a fixed preset positive integer number, and based on a degree determined for each enumerated set, wherein the degree of a set is determined as a number of edges connecting nodes inside the set to nodes outside.
8. A method for searching a computer network, the method comprising:
preparing an index which associates one or more keywords with a page reference (URL);
assigning a numerical relevance value to each page;
receiving one or more keywords from a user;
using the index to select a list of page references, based on the keywords;
ordering the list by the numerical relevance value of the page references, at least in part; and
outputting the list to a user,
wherein said numerical relevance value for each particular page is determined by:
obtaining a digital representation of a local neighborhood structure of the particular page in the computer network in a computer-readable non-volatile memory;
determining a numerical value characteristic of the particular page's relevance, and
outputting the numerical value to a user,
characterized in that the numerical value is determined based on the neighborhood structure of the particular page, and wherein the numerical value is determined based on degrees of all sets of pages in the computer network reachable from the particular page using at most k edges, where k is a fixed preset positive integer number, and based on a degree determined for each enumerated set, wherein the degree of a set is determined as a number of edges connecting pages inside the set to pages outside.
2. The method according to claim 1, wherein the number of edges k is equal to 2.
3. The method according to claim 1, wherein the number of edges k is equal to 3.
4. The method of claim 1, wherein the numerical value is obtained as a real number.
5. The method of claim 2, wherein the degree is combined with a logarithm of the degree.
6. The method of claim 2, wherein the numerical value is scaled by a logarithm of twice the particular node's degree.
10. The method according to claim 9, wherein the control signal is sent to the selected network element.
11. The method according to claim 9, wherein the control signal is sent to at least one neighbour element of the selected element.
12. The method according to claim 9, wherein the selected network element is a neighbour of the network element to which the control signal is sent.
13. The method according to claim 10, wherein the control signal is generated, based on the determined relevance.
14. The method according to claim 9, further comprising comparing the relevance with a current or projected load and changing connectivity/capacity of the network element, if a certain threshold is surpassed.
15. A network controller, adapted to execute a method according to claim 9.

This application is related to and claims priority from: (1) U.S. Provisional Patent Application No. 61/933,938, filed Jan. 31, 2014; and (2) European Patent Application No. EP 14 153 465.1, filed Jan. 31, 2014, the entire contents of both of which are hereby fully incorporated herein by reference for all purposes

News and rumors spreading on social media, the spread of political opinions, the uptake of business and social innovations, the impact of disease outbreaks—society is increasingly characterized by diffusive/epidemic processes on network, and the infrastructure networks which enable such communication are also of increasing importance. However, one cannot easily or reliably measure the relevance of individual network nodes to these processes. The larger the network is, and many of today's networks comprise more than >1 billion nodes, the more important and the more difficult this task becomes.

The current state of the art is limited to measures designed to identify the most highly influential network nodes, the so-called centrality indicators such as degree, eigenvalue centrality, or k-shell. For example Google uses the Page Rank centrality to identify the most relevant web page for a given search term. But such measures are strongly limited in that they are only informative for the top 1/10th of 1 percent of nodes and in that they only rank, but do not quantify, node relevance and their accuracy depends on the topology of the network, how this topology is sampled, and the type of diffusive process.

It is therefore an object of the invention to provide an automatic method for determining a network node's individual relevance that is easy to determine and correlates strongly with a node's actual relevance in the above sense. It is a further object of the invention to apply the method in the context of network monitoring, search and control.

These objects are achieved by the computer-implemented methods and a device according to the independent claims. Advantageous embodiments. are defined in the dependent claims.

The relevance or expected force (ExF) of a node in a network determined by the invention measures the contribution of a node to overall network flows. As the relevance is estimated based on a local neighborhood structure, i.e. the nodes and edges in the neighborhood of the node, it can be determined very efficiently.

Alternatively, the relevance of a node can also be characterized in terms of outcomes of a spreading process starting from that node. An outcome of a spreading process may be determined as a set of network nodes infected by the spreading process, i.e. a set of nodes reachable from the node of interest using k edges (transmission clusters of size k), where k is a fixed preset number.

The estimation of the relevance may further be based on the number of edges of the network connecting a node infected by the spreading process and a non-infected node, i.e. the numerical value may be based on the degree of each cluster in the above described enumeration. The estimation may further be based on individual weights assigned to the edges. The spreading process may comprise two or three transmission events, i.e. k may be equal to 2 or 3.

The distribution of cluster degree values may be summarized by their entropy, i.e. the, eventually normalized number of edges may be combined with its logarithm. The estimate may be combined or scaled by the logarithm of twice the node's degree or some other constant factor.

The invention also comprises a method for modifying a network or at least a representation of a network, comprising the steps of obtaining a computer-readable representation of the network; determining a relevance of one or several nodes of the network; and modifying the network, based on the determined relevance.

The invention also comprises a network, comprising nodes and connections, wherein at least one node is adapted to determine its own relevance according to the above-described methods.

A network according to the invention may be an electricity grid, a mobile phone network, a telecom and internet routing network, a public wireless network, a road-transportation network or the like, or computer-readable representations of networks, like a social graph of an online communications network.

As the inventors found when applying the inventive method to real-world test cases, the determined numerical value strongly correlates with a node's importance, while it does not involve complex computations. A specific diffusion process (i.e. for different diseases) does not need to be specified. The outcome of the diffusion process is not important; the advantage is that importance of each single node may be quantified. A ranking method building according to an embodiment of the invention on this ability provides a way to identify those nodes or elements of a technical structure that are most important in view of diffusion processes or have no importance at all. Queries on the overall influence of an individual node can be answered in real-time, due to the simplicity of the determination.

The invention's dependence only on local topology is an invaluable asset. In the above cases, the actual underlying network is only known through incomplete and biased observation. Close observation can improve the accuracy of the network representation, but such observation is costly and must be rationed. The invention allows firstly better prioritization of such resources, and further, since closer observation is expected to give a more accurate picture of the section of the network investigated better estimates of the true importance of the nodes investigated.

The measure is more accurate and more stable than any existing measure. Comparisons were made to the eigenvalue centrality, the k-shell, and the accessibility, representing both the known and the cutting edge state of the art metrics. Accuracy is measured in terms of linear correlation between the metric and the outcome of a spreading process, and is assessed for three types of spreading processes in continuous and discrete time over five families of random networks and twenty four real world networks. The expected force has correlations exceeding 0.85 in almost all cases, outperforming the other measures by a wide and statistically significant margin. Stability is observed in that the variation/standard error for the expected force is smaller than for the other measures, and that the correlations are equally strong regardless of the structure of the network. For the remaining measures, their performance varies based on network structure and the type of epidemic process simulated.

In addition to high predictive accuracy, the expected force is rapidly computable. Benchmarking tests suggest that computational times are near linear in the number of nodes when run on a single processor. More importantly, since the expected force relies only on local information, it can be computed in a massively parallel fashion, and is robust against incomplete sub sampling of the network.

Finally, as the expected force again relies only on local topology, it is suitable for dynamic and/or unknown networks.

The eigenvalue centrality is known to be highly unstable to network perturbations, as is the k-shell. As path counting metrics are expressed in terms of the adjacency matrix, they cannot be computed when the adjacency matrix is not fully specified. As no real-world network is fully known or specified, their practical value is emitted.

These and further aspects of the invention are described in more detail in the following description of various embodiments, in connection with the drawing, in which

FIG. 1 shows a schematic flowchart 100 of a method for determining the relevance of node in a network according to an embodiment of the invention.

FIG. 2 is a schematic illustration of how the expected force is determined from the possible outcomes of two transmissions.

FIG. 3 shows a correlation of spreading power metrics to epidemic outcomes on simulated networks.

FIG. 4 shows a correlation of spreading power metrics to epidemic outcomes on real networks.

FIG. 5 illustrates how spreading power is a factor of a node's first and second order degree.

FIG. 6 shows an image of a graph comprising nodes and edges.

FIG. 7 shows a diagram wherein the (logarithm of the) time to saturation of a spreading process starting from a node is plotted against the spreading power/relevance/expected force of that node.

FIG. 8 shows a diagram wherein the per-round infection probability is plotted against an expected of network nodes, e.g. airports.

FIG. 9 shows a schematic diagram of a system 900 for monitoring a network according to another embodiment of the invention.

FIG. 10 shows a schematic flowchart 1000 of a method for searching network based on a relevance of node in the network according to a different embodiment of the invention.

FIG. 11 shows a network with a network controller used for controlling a network element, based on a relevance score according to an embodiment of the invention.

FIG. 1 shows a schematic flowchart 100 of a method for determining the relevance of node in a network according to an embodiment of the invention.

In step 110, a computer-readable representation of the network is obtained. In step 120, a node is selected. In step 130, all possible clusters 1, . . . J of infected nodes after X transmission events are enumerated, assuming no recovery.

For X=2, these include all combinations of i plus two nodes at distance one from i, and i plus one node at distance one and one at distance two (within the limits of the local network topology). Each cluster is counted once for each way it can form. For example, a cluster of two nodes connected to i but not each other could form in two ways. If they are connected to each other, the cluster could form in four ways. In step 140, the degree of a cluster of nodes is determined as the number of edges connecting nodes inside the cluster to nodes outside.

In step 150, the normalized sequence

d 1 , , d J = 1 J d j ( d 1 , , d J )

Is formed, where d is the degree of cluster j. Then the expected force of node i is determined in step 160 as

ExF ( i ) = - j = 1 J d _ j log ( d _ j ) ( 1 )

Finally, the expected force is output in step 170.

Preliminary investigations found that X=2 is already sufficient for predictive purposes.

One modification may be in order for SIS/SIR processes, inspired by the following. Imagine a node with degree one connected to a hub. While such a node will have a high expected force, its chance of realizing this force depends on transmitting to the hub before recovering. In networks where such nodes are common, it may be helpful to account for this factor by scaling node ExF by the log of twice the node's degree,
ExFM(i)=log(2 deg(i))ExFX(i)  (2)
multiplication by two being necessary since the log of one is zero.

FIG. 2 is a schematic illustration of how the expected force is determined from the possible outcomes of two transmissions. In the example (sub) network above, the network will be in one of eight possible states after two transmissions from the seed node (red). Two are illustrated above, where the seed has transmitted to the two orange nodes along the solid black edges.

Each given state has an associated number of (dashed orange) edges to susceptible nodes (blue), the cluster degree. States containing two neighbors of the seed (panel a) can form in two ways or, if they are part of a triangle, four ways. In this example, the two transmissions can occur in thirteen possible ways. The expected force is the entropy of the (normalized) cluster degree.

FIG. 3 shows a correlation of spreading power metrics to epidemic outcomes on simulated networks. Violin plots show the distribution of observed correlation values for each spreading process outcome in each network family. The expected force and ExFM (orange shades) are consistently strong, with mean correlations greater than 0.85 and small variance. The other measures (k-shell, eigenvalue centrality, and accessibility, blue-green shades) show both lower mean values and higher variance, as seen in the position and vertical spread of their violins. Each violin summarizes correlations computed on 100 simulated networks. Spreading processes (x axis) are suffixed to indicate simulations in continuous (−C) or discrete (−D) time. The epidemic outcome for SI processes is the time until half the network is infected. For SIS and SIR processes it is the probability that an epidemic is observed. Table I (below) shows the numbers for the graph in FIG. 3.

Table 1 (numbers for FIG. 3): Mean correlations between node spreading power metrics and epidemic outcomes on each type of spreading process on the simulated networks, by network model. Shown are the mean and standard error in correlations measured on one hundred networks from each family. Spreading processes are suffixed to indicate simulations in continuous (−C) or discrete (−D) time. Epidemic outcomes are time to half coverage for SI processes and epidemic potential in the remaining processes.

TABLE 1
Expected force accessibility eigenvalue centrality k-shell
Pareto
SI 0.84 +/− 0.04 0.66 +/− 0.05 0.38 +/− 0.06 0.76 +/− 0.05
SIS-C 0.93 +/− 0.02 0.78 +/− 0.05 0.53 +/− 0.08 0.77 +/− 0.05
SIS-D 0.94 +/− 0.02 0.78 +/− 0.06 0.51 +/− 0.09 0.79 +/− 0.05
SIR-C 0.91 +/− 0.02 0.71 +/− 0.05 0.43 +/− 0.08 0.82 +/− 0.03
SIR-D 0.87 +/− 0.14 0.68 +/− 0.12 0.40 +/− 0.09 0.82 +/− 0.14
Amazon
SI 0.87 +/− 0.02 0.84 +/− 0.02 0.44 +/− 0.05 0.88 +/− 0.02
SIS-C 0.95 +/− 0.01 0.91 +/− 0.02 0.63 +/− 0.06 0.72 +/− 0.05
SIS-D 0.95 +/− 0.01 0.92 +/− 0.03 0.59 +/− 0.06 0.74 +/− 0.06
SIR-C 0.92 +/− 0.02 0.88 +/− 0.03 0.53 +/− 0.05 0.82 +/− 0.04
SIR-D 0.90 +/− 0.02 0.87 +/− 0.04 0.46 +/− 0.06 0.87 +/− 0.04
Internet
SI 0.82 +/− 0.03 0.77 +/− 0.08 0.37 +/− 0.04 0.73 +/− 0.04
SIS-C 0.92 +/− 0.03 0.61 +/− 0.09 0.65 +/− 0.04 0.95 +/− 0.01
SIS-D 0.85 +/− 0.03 0.45 +/− 0.08 0.82 +/− 0.04 0.89 +/− 0.03
SIR-C 0.92 +/− 0.02 0.62 +/− 0.09 0.66 +/− 0.04 0.95 +/− 0.01
SIR-D 0.89 +/− 0.03 0.60 +/− 0.09 0.60 +/− 0.04 0.98 +/− 0.01
Astrophysics
SI 0.81 +/− 0.02 0.51 +/− 0.07 0.36 +/− 0.03  0.6 +/− 0.04
SIS-C 0.92 +/− 0.01 0.31 +/− 0.05 0.71 +/− 0.02 0.95 +/− 0.01
SIS-D 0.85 +/− 0.02  0.2 +/− 0.04 0.86 +/− 0.03 0.96 +/− 0.01
SIR-C 0.92 +/− 0.01 0.31 +/− 0.05 0.71 +/− 0.02 0.95 +/− 0.01
SIR-D 0.89 +/− 0.01 0.29 +/− 0.05 0.67 +/− 0.03 0.97 +/− 0.01
Facebook
SI 0.83 +/− 0.02 0.43 +/− 0.1  0.38 +/− 0.02 0.61 +/− 0.04
SIS-C  0.9 +/− 0.02 0.22 +/− 0.05 0.73 +/− 0.02 0.95 +/− 0.01
SIS-D 0.82 +/− 0.02 0.14 +/− 0.04 0.87 +/− 0.02 0.97 +/− 0.01
SIR-C  0.9 +/− 0.02 0.22 +/− 0.05 0.73 +/− 0.02 0.95 +/− 0.01
SIR-D 0.87 +/− 0.02  0.2 +/− 0.05  0.7 +/− 0.03 0.97 +/− 0.01

FIG. 4 shows a correlation of spreading power metrics to epidemic outcomes on real networks. Point and error bar plots show the observed correlation and 95% confidence interval between each measure and spreading process outcome on the 24 real networks. The expected force and ExFM (orange shades) show strong performance, consistently outperforming the other metrics (k-shell, eigenvalue centrality, and accessibility when computed, blue-green shades). The epidemic outcome for SI processes is the time until half the network is infected. For SIS and SIR processes it is the probability that an epidemic is observed.

The suffix \-D″ indicates spreading processes simulated in discrete time. Individual panels are given as separate (larger) figures in Supplementary Figures FOO to BAR.

Tables 2 and 3 (below) show numbers for FIG. 4.

Table 2 (numbers for FIG. 4): Correlation between spreading power metrics and time to half coverable in real world networks. Shown is the estimated correlation from 1,000 nodes on the given network, along with the 95% confidence bounds of the estimate. Accessibility is not measured for networks with more than 25,000 nodes.

TABLE 2
Expected force accessibility eigenvalue centrality k-shell
PGPgiantcompo 0.69 +/− 0.03 0.58 +/− 0.04 0.19 +/− 0.06 0.43 +/− 0.05
amazon0302 0.54 +/− 0.04 0.15 +/− 0.06 0.30 +/− 0.06
amazon0601 0.74 +/− 0.03 0.09 +/− 0.06 0.63 +/− 0.04
ca-AstroPh 0.84 +/− 0.02 0.49 +/− 0.05 0.29 +/− 0.06 0.58 +/− 0.04
ca-CondMat 0.84 +/− 0.02 0.53 +/− 0.04 0.26 +/− 0.06 0.65 +/− 0.04
ca-GrQc 0.78 +/− 0.02 0.58 +/− 0.04 0.16 +/− 0.06 0.36 +/− 0.05
ca-HepPh 0.82 +/− 0.02 0.54 +/− 0.04 0.20 +/− 0.06 0.39 +/− 0.05
ca-HepTh 0.78 +/− 0.02 0.56 +/− 0.04 0.05 +/− 0.06 0.47 +/− 0.05
cit-HepPh 0.82 +/− 0.02 0.28 +/− 0.06 0.68 +/− 0.03
cit-HepTh 0.84 +/− 0.02 0.57 +/− 0.04 0.38 +/− 0.05 0.64 +/− 0.04
com-dblp 0.79 +/− 0.02 0.05 +/− 0.06 0.36 +/− 0.05
email-EuAll 0.41 +/− 0.05 0.34 +/− 0.05 0.50 +/− 0.05
email-Uni 0.92 +/− 0.01 0.61 +/− 0.04 0.56 +/− 0.04 0.84 +/− 0.02
facebooklcc 0.86 +/− 0.02 0.19 +/− 0.06 0.59 +/− 0.04
loc-brightkite 0.79 +/− 0.02 0.13 +/− 0.06 0.54 +/− 0.04
loc-gowalla 0.66 +/− 0.03 0.25 +/− 0.06 0.53 +/− 0.04
p2p-Gnutella31 0.94 +/− 0.01 0.72 +/− 0.03 0.53 +/− 0.04 0.92 +/− 0.01
soc-Epinions1 0.80 +/− 0.02 0.33 +/− 0.06 0.47 +/− 0.05
soc-Slashdot0902 0.84 +/− 0.02 0.42 +/− 0.05 0.60 +/− 0.04
soc-sign-epinions 0.81 +/− 0.02 0.29 +/− 0.06 0.47 +/− 0.05
web-Google 0.69 +/− 0.02 0.07 +/− 0.04 0.59 +/− 0.02
web-NotreDame 0.43 +/− 0.05 0.18 +/− 0.06 0.26 +/− 0.06
web-Stanford 0.25 +/− 0.06 0.06 +/− 0.06 0.12 +/− 0.06
wiki-Vote 0.86 +/− 0.02 0.50 +/− 0.05 0.50 +/− 0.05 0.72 +/− 0.03

Table 3 (also numbers for FIG. 4): Correlation between spreading power metrics and epidemic potential in discrete time SIS processes on real world networks. Shown is the estimated correlation from 1,000 nodes on the given network, along with the 95% confidence bounds of the estimate. Accessibility is not measured for networks with more than 25,000 nodes.

TABLE 3
Expected force accessibility eigenvalue centrality k-shell
PGPgiantcompo 0.87 +/− 0.02 0.62 +/− 0.04 0.33 +/− 0.06 0.77 +/− 0.03
amazon0302 0.79 +/− 0.02 0.12 +/− 0.06 0.51 +/− 0.05
amazon0601 0.77 +/− 0.03 −0.01 +/− 0.06  0.68 +/− 0.03
ca-AstroPh 0.94 +/− 0.01 0.51 +/− 0.05 0.41 +/− 0.05 0.83 +/− 0.02
ca-CondMat 0.93 +/− 0.01 0.59 +/− 0.04 0.35 +/− 0.05 0.85 +/− 0.02
ca-GrQc 0.92 +/− 0.01 0.52 +/− 0.05 0.25 +/− 0.06  0.7 +/− 0.03
ca-HepPh 0.92 +/− 0.01 0.51 +/− 0.05 0.28 +/− 0.06 0.55 +/− 0.04
ca-HepTh 0.92 +/− 0.01 0.64 +/− 0.04 0.10 +/− 0.06 0.72 +/− 0.03
cit-HepPh 0.93 +/− 0.01 0.55 +/− 0.04 0.38 +/− 0.05 0.93 +/− 0.01
cit-HepTh 0.93 +/− 0.01 0.71 +/− 0.03 0.57 +/− 0.04 0.90 +/− 0.01
com-dblp 0.90 +/− 0.01 0.08 +/− 0.04 0.55 +/− 0.03
email-EuAll 0.36 +/− 0.05 0.64 +/− 0.04 0.85 +/− 0.02
email-Uni 0.95 +/− 0.01 0.61 +/− 0.04 0.75 +/− 0.03 0.97 +/− 0.00
facebooklcc 0.93 +/− 0.01 0.31 +/− 0.06 0.88 +/− 0.01
loc-brightkite 0.85 +/− 0.02 0.58 +/− 0.04 0.29 +/− 0.06 0.85 +/− 0.02
loc-gowalla 0.68 +/− 0.03 0.51 +/− 0.05 0.89 +/− 0.01
p2p-Gnutella31 0.95 +/− 0.01 0.83 +/− 0.02 0.68 +/− 0.03 0.92 +/− 0.01
soc-Epinions1 0.77 +/− 0.03 0.63 +/− 0.04 0.85 +/− 0.02
soc-Slashdot0902 0.80 +/− 0.02 0.71 +/− 0.03 0.93 +/− 0.01
soc-sign-epinions 0.76 +/− 0.03 0.54 +/− 0.04 0.81 +/− 0.02
web-Google 0.79 +/− 0.02 0.10 +/− 0.06 0.91 +/− 0.01
web-NotreDame 0.73 +/− 0.03 0.34 +/− 0.06 0.49 +/− 0.05
web-Stanford 0.70 +/− 0.03 0.38 +/− 0.05 0.77 +/− 0.03
wiki-Vote 0.94 +/− 0.01 0.48 +/− 0.05 0.71 +/− 0.03 0.95 +/− 0.01

FIG. 5 illustrates how spreading power is a factor of a node's first and second order degree.

Plotting expected force (x-axis) versus node degree (orange), the sum of the degree of all neighbors (blue), and the sum of the degree of all neighbors at distance 2 (green) (y-axis is log scale) shows that for nodes with low ExF, the neighbor's degree has strong correlation to ExF, while for nodes with high ExF their own degree is more closely correlated. The result is accentuated in the denser collaboration network in comparison to the more diffuse Pareto network.

FIG. 6 shows an image of a graph comprising nodes and edges, wherein the nodes are, e.g. airports and wherein their size is scaled according to the node's relevance within the graph. Such an image of a graph, including the relevance scores may be done automatically and displayed to a user, who can immediately discern nodes of more and lesser relevance, without actually knowing the numerical scores. This permits an intuitive preliminary selection of nodes of interest for a particular analytical task.

If the network nodes are filtered, based on their relevance scores, the information required to be analyzed by a user may be reduced; a network may be viewed on a more abstract level.

FIG. 7 shows a diagram wherein the (logarithm of the) time to saturation of a spreading process starting from a node is plotted against the spreading power/relevance/expected force of that node. Saturation is defined as the time when half of the network's nodes is reached by the spreading process. As the diagram shows, the correlation between these two quantities is very striking, confirming the predictive power of

FIG. 8 shows a diagram wherein the per-round infection probability is plotted against an expected of network nodes, e.g. airports. At the start of a round, every infected airport/node tries to infect each of its neighbors with a certain probability. If a node is infected in a round, it will start as un-infected on the next round. Per-Round infection probability in this model (standard susceptible-infected-susceptible model) corresponds to the frequency a node is infected in a given number of rounds, e.g. 1000. This number can be interpreted for example as the frequency of a certain rumor being heard by a particular node.

As the diagram shows, there is a strong nonlinear correlation between this quantity and amount of expected force.

FIG. 9 shows a schematic diagram of a system 900 for monitoring a network according to another embodiment of the invention. Graphs 901, 902 and 903 are examples of networks. The center column shows processing stream.

In step 910, a representation of the network is obtained in a computer readable format. The representation may be full or partial, may represent physical connectivity (i.e. the wires connecting equipment in a data center, or the wires/transformers connecting electricity producing and consuming nodes in an electricity network), a mix of physical and virtual connectivity (i.e. the wiring structure and assigned IP addresses of all computers used by a certain organization), inferred connectivity (i.e. created from who follows who on twitter, or who has worked on a common project in an organization).

The representation as shown is an edgelist which may additionally contain more information about the network connectivity (the strength of a given edge, a time stamp for when the edge is observed, . . . ). This representation is fed to one or more computer processing unit(s), to compute the expected force of some or all nodes in the network, where the “network” is defined from the computer readable representation.

Additionally, the network may first be pre-processed, or may otherwise include node and/or edge annotations. For example, nodes could be initially labeled by function or class (i.e. in a power grid, either generator or load, in a corporate structure by department, in a social network by relationship) or by membership in some type of community (possibly defined via graph-theoretic methods in the pre-processing step), or by some other approach which classifies nodes and/or increases/reduces the granularity of the network structure.

In step 920, one or more computer processors are instructed to apply the algorithm to the network representation so as to compute the ExF of some or all of the nodes in the network. The calculation of the ExF can be modified to account for features determined in the pre-processing stage. For example, the ExF could be calculated at the level of community rather than individual node, or using different weights for nodes or connections to nodes of certain classes. The results of the analysis are used to modify the network structure as above and/or stored and/or displayed.

In step 930, the user is presented with a list of nodes to monitor (i.e. the top XX most influential nodes in the network, the top XX most influential nodes in each category/department, the top XX most influential contact who work at company Z). In an organization-wide computer security application, where nodes are i.e. individual computing devices or user accounts, this could be used to determine the amount of monitoring resources to allocate to each node.

Alternatively, the results can also be presented as a histogram of values and a single value for a node of interest; as a set of histograms by device type, e.g. mobile devices, workstations, and servers connected in a corporate intranet; as a pictorial representation of the network with nodes sized by expected force.

They can additionally be stored in a database in step 940. The database is used to support queries, such as all of my contacts at company X, sorted by influence; pandemic outbreak risk for African airports; or, by calculating the expected force at different timesteps, the temporal connectivity profile of each server in a corporate intranet.

Additionally, domain/business logic may be applied to either the raw ExF score, or in reference to historical values from the database, to produce alerts in optional step 950. The domain logic may combine the ExF and an observed signal to identify nodes of interest. A user may be presented with a list of nodes to monitor and why they were flagged as interesting.

In a preferred embodiment, the inventive method can be applied to monitor a dynamic network. By repeating it at regular intervals, historical ExF values may be established for each node. The monitoring reports nodes whose ExF value deviates from historic levels. In a organization-wide computer security application, this could be used to identify nodes who suddenly change their connectivity pattern.

In an organization-wide computer security application, the business logic could stratify nodes by expected force and compare predicted with actual output. Output could, for example, be measured by volume, triggering an alarm when a highly central node goes silent. Alternately, it could be by content—an alert from a highly central node may be deemed more relevant than one from a peripheral node. Alternately, it could be access: identifying peripheral nodes (which presumably are less monitored) which have access to highly valuable information or highly sensitive equipment.

In a further embodiment, the network structure itself may also be dynamic. The dynamics may occur on various time-scales; they can be stable over the time-frame of the decision process, or their dynamics may be part of the decision process. An example of the first is an evaluation of the impact of some intervention by measuring the ExF (or distribution of ExF values) before and after the intervention. An example of the second is a load-balancing scheme which attempts to minimize fluctuations in node ExF, or continually monitor the network for nodes/regions whose ExF is growing/shrinking at interesting rates.

The system is further enabled to allow awareness and monitoring of changes to the network representation. This can be by uploading the full current network configuration at fixed or irregular intervals, or partial updates again at fixed or irregular intervals. For example, the full network representation could be uploaded to the system every evening, or each action taken on the network could update the representation. At updates, the system calculates the ExF of some or all nodes as above. The values are stored in some database along with a time-stamp.

Further analysis may then be applied to evaluate temporal structure of the ExF measures, on all or part of the network.

The inventive ExF measure can also be employed to provide better search methods for document databases.

Node rankings underpin search engines such as Google. The invention gives more meaningful and more stable results than those produced by i.e. Google's PageRank algorithm. This can power better search results in a number of domains: Internet/WWW A better Internet search engine. It is now expected that large entities, be they corporate or government, maintain an expansive web-based interface to their inner workings. These intranets are the primary manner in which people (both employees and customers/citizens) interface with the entity. Custom search engines for these intranets is a growing need. Specialized knowledge databases Likewise, knowledge is increasingly stored and accessed via computerized databases. Two especially relevant domains are legal and medical. An example is IBM Watson. A key feature of Watson is its claimed ability to measure its confidence in its answers. Better metrics of the relevance of search results would lead to more accurate confidence scores. With an increasingly large portion of our productivity, both personal and professional, in digital form, search is becoming increasingly important to organizing and retrieving our documents. Personal search engines, to i.e. organize digital photos and link with other relevant information (i.e. contemporary emails). Search is also becoming social. One does not want to just find “Chinese restaurant Munich”; required are results ranked by the credibility of the reviewers or by which locations are popular with our friends. The invention can be used to accurately quantify the significance of the participants; these weights are then used to re-weight the relevance based search results so that the finally returned list of search items is a “friends” ordering of the relevant search hits.

FIG. 10 shows a schematic flowchart 1000 of a method for searching network based on a relevance of node in the network according to a different embodiment of the invention. Here, the expected force values are stored in a database, along with other information regarding the nodes. Expected force values are used to order search queries placed into the database.

In step 1010, an index which associates one or more key words with page references is prepared, wherein the page references may be URLs referencing documents in the World Wide Web. In step 1020, a relevance score is assigned to each page using the expected force metric. In step 1030, one or more keywords are received from a user. In step 1040, the index is used to select a list of page references, based on the keywords. These selected page references are then ordered by their relevance score in step 1050, and returned to the user in step 1060.

This process is similar in overall structure to that used by the Google search engine, but differs in the determination of the relevance score. The scientific literature has demonstrated several weaknesses in the PageRank score which is the basis of the original Google relevance score. The PageRank algorithm includes a damping factor, and the choice of damping factor has a strong influence on the resultant relevance scores. Bressan et al (Marco Bressan and Enoch Peserico, Choose the damping, choose the ranking?, Journal of Discrete Algorithms, 2010, 8, p. 199-213, 2) proved that at least on some graphs, the top k nodes assume all possible k! orderings as the damping factor varies, even if it varies within an arbitrarily small interval (e.g. [0.84999, 0.85001]). Son et al. (Son, S-W. and Christensen, C. and Grassberger, P. and Paczuski, M., PageRank and rank-reversal dependence on the damping factor, Phys Rev E Stat Nonlin Soft Matter Phys, 2012, vol. 86, p. 066104) investigated PageRank scores of internet web pages as a function of different choices of the damping factor, finding that rank reversal occurs frequently over a broad range after even slight changes to the damping factor. PageRank is also sensitive to how the network is observed. Ghosal et al (Ghoshal, Gourab and Barabási, Albert László, Ranking stability and super-stable nodes in complex networks, Nat Commun, 2011, vol. 2, p. 394) show that for random networks the ranking provided by PageRank is sensitive to perturbations in the network topology, making it unreliable for incomplete or noisy systems. Pei et al (Sen Pei and Lev Muchnik and Jose S. Andrade and Zhiming Zheng and Hernan A. Makse, Searching for superspreaders of information in real-world social media, Scientific Reports, vol. 4, 2014) reach an even stronger conclusion by following the real spreading dynamics in a wide range of networks, finding that PageRank fails in ranking users' influence. Ghosh and Lerman (Rumi Ghosh and Kristina Lerman, Rethinking Centrality: The Role of Dynamical Processes in Social Network Analysis}, Discrete and Continuous Dynamical Systems Series B, 2014, vol. 19, pp. 1355-1372, number 5, July) echo this finding, noting that the random walk model which underlies the PageRank algorithm is not appropriate for social phenomena, and that PageRank based rankings do no show good agreement with empirical influence rankings.

The expected force metric overcomes all of these shortcomings. It is parameter-free, thus its results are not dependent on the choice of some arbitrary parameter. It depends only on local information, making it robust for incompletely observed or noisy networks. The underlying model is derived directly from the mathematics of spreading processes, allowing it to accurately match real spreading dynamics and empirical influence rankings.

The expected force presents an additional advantage over PageRank in that its relevance score quantifies node influence. PageRank, in contrast, is designed to provide a ranking which identifies the most influential nodes, but does not provide qualitative differences between the different ranks. PageRank (when its results are correct) can tell you that node A is more relevant than node B, but not by how much. The expected force is designed explicitly to provide such information, and its strong correlations to epidemic outcomes show that it succeeds in this task.

Modern industrial infrastructure is made by networking components. The invention can help optimize the design of such infrastructure, be it existing infrastructure which must be built out in a better way or the installation from scratch of a new infrastructure project. Here, the contribution of a node may be interpreted in terms of i insight into how (altering) node connectivity impacts network capacity, better estimates of the quality/capacity of the physical equipment requirements, load balancing and routing, expected level of fluctuations, expected impact of failures/vulnerability analysis, pricing inputs to the system.

Specific areas of application include the electricity grid. In a smart electricity grid, power production is decentralized to many small producers with erratic power creation, while the large power plants retain responsibility for only a base level of the total electricity in the system. Load balancing and routing remain challenging problems in mobile phone networks, as individual phones move in and out of the range of various cell towers. The invention can be used to quantify each towers current capacity as a function of its connections to the rest of the network and the number of phones connected to that tower; or, assignment of phones to the towers with range can be further scaled by the contribution of each tower to allow maximal traffic speeds across the entire network. Up to 4% of current electricity generation goes towards the internet. Likewise, as companies move information to the cloud, the routing of information between user and cloud storage server locations becomes increasingly important. Better routing of traffic flow can cut costs and/or signal times. Large-scale public WiFi installations are becoming the norm, not only for venues which expect substantial crowds (Olympic villages, large convention centers, . . . ) but growingly for cities (Luxembourg has a city-wide public WiFi network) and vacation destinations (El Hierro has installed an island-wide public WiFi network). These systems require complex physical infrastructure in the form of a network of routers, repeaters, and antennas. Road/transportation Optimal routing is a major concern of delivery/courier services, logistics companies, taxi firms, Ubercar, and even personal GPS devices (also Google/Apple maps)

FIG. 11 shows a network 1100 with a network controller 1110 used for controlling a network element, based on a relevance score according to an embodiment of the invention. For example, the network can be a computer network with several routers, but also telecommunications or electricity network.

In a routing application, the method comprises selecting a network element, as such as a router 1120. The router 1120 can be virtual, as in a datacenter management system which has pre-allocated a certain amount of memory and/or processing time and/or other system resources. It can also be a network relay or a WiFi router. The network controller 1110 can monitor the network elements, e.g. their current load. To that purpose, the method determines an ExF score of a network element, e.g. a network element wherein a certain load has occurred. In the application, this may be the ExF score of the selected router, representing its connectivity/capacity, i.e. how much the router contributes to routing the network traffic. In a different application, the ExF score not of the selected network element but for example, one of its neighboring elements are determined, as an approximation.

In a further step, the method comprises generating a control signal, based on the determined ExF score. In the present application, the control signal is a signal representing a message according to a router communication protocol. It may indicate, that the router increases its capacity, for example if the load of the router has increased. By taking the ExF score of the router into account, network congestion can be avoided very effectively, because the score pinpoints those nodes, where capacity increases a most effective in terms of cost. When the control signal was generated, it may be sent to the selected network element.

Alternatively, the control signal may be sent to at least one neighbor element of the selected element, for example to the router 1130, if the selected element cannot be controlled by itself, the neighbors can be shut off or told to reconnect/reorganize. Using the ExF score allows choosing the most influential neighbors for modification first. In a further embodiment, the score of a neighbor element may be determined, e.g. for selecting the highest score neighbor to be used for sending a message to the whole network.

In a further embodiment, the method also comprises comparing the determined ExF score with a current or projected load of a network node and changing connectivity/capacity of the network element, if a certain threshold is surpassed.

Applications

Social media also create explicit networks of friends and followers. Here, the value of a node is its ability to spread information (giving economic value via advertising/promotion), relevance as a passive source of information (i.e. the target of a search)

Given a social network, the invention can measure the spreading power of each person in the network. Users agree to advertise products, and are paid a rate proportional to their ranking, thereby allowing more rational pricing. In comparison to “Klout” (www.klout.com), the invention provides a more accurate measure of a person's influence.

Node quantification can be made more fine-grained by tying it dynamically to user content. Rather than search top-down for structure, it is possible for i.e. Twitter to track a user's network for each individual hashtag that they post, in real time. The Twitter user's profile would then include their rating on each topic they tweet about, metrics which Twitter could use for its own purposes or which the user could use as above for i.e. paid product promotion.

Internet retailers such as Amazon and Netflix commonly list “related items” after each displayed product; such product relationships are stored as a network, e.g. the Amazon co-purchase network. The invention accurately quantifies the importance of each node on the network. Such networks tend to be dynamic, as people's tastes change over time and as new products become available. Hence, the advantage of the dynamic nature of the invention, which allows the score of each node in the network (where nodes are products) to be updated with every purchase, giving extremely fine-grained resolution of actual product relationships. Taking it in another direction, the local nature of the inventive method means that it can be computed independently for each user, allowing for more user-centric recommendations.

Implementation

As noted above, example embodiments may include computer program products. The computer program products may be stored on computer-readable media for carrying or having computer-executable instructions or data structures. Such computer-readable media may be any available media that can be accessed by a general purpose or special purpose computer. By way of example, such computer-readable media may include RAM, ROM, EPROM, EEPROM, CD-ROM or other optical disk storage, magnetic disk storage or other magnetic storage devices, or any other medium that may be used to carry or store desired program code in the form of computer-executable instructions or data structures and which can be accessed by a general purpose or special purpose computer. When information is transferred or provided over a network or another communications connection (either hardwired, wireless, or a combination of hardwired or wireless) to a computer, the computer properly views the connection as a computer-readable medium. Thus, any such connection is an example of a computer-readable medium. Combinations of the above are also to be included within the scope of computer readable media. Computer-executable instructions include, for example, instructions and data which cause a general purpose computer, a special purpose computer, or a special purpose processing device to perform a certain function or group of functions. Furthermore, computer-executable instructions include, for example, instructions that have to be processed by a computer to transform the instructions into a format that is executable by a computer. The computer-executable instructions may be in a source format that is compiled or interpreted to obtain the instructions in the executable format. When the computer-executable instructions are transformed, a first computer may for example transform the computer executable instructions into the executable format and a second computer may execute the transformed instructions.

The computer-executable instructions may be organized in a modular way so that a part of the instructions may belong to one module and a further part of the instructions may belong to a further module. However, the differences between different modules may not be obvious and instructions of different modules may be intertwined.

Example embodiments have been described in the general context of method operations, which may be implemented in one embodiment by a computer program product including computer-executable instructions, such as program code, executed by computers in networked environments. Generally, program modules include for example routines, programs, objects, components, or data structures that perform particular tasks or implement particular abstract data types. Computer-executable instructions, associated data structures, and program modules represent examples of program code for executing steps of the methods disclosed herein. The particular sequence of such executable instructions or associated data structures represents examples of corresponding acts for implementing the functions described in such operations.

Some embodiments may be operated in a networked environment using logical connections to one or more remote computers having processors. Logical connections may include for example a local area network (LAN) and a wide area network (WAN). The examples are presented here by way of example and not limitation.

Such networking environments are commonplace in office-wide or enterprise-wide computer networks, intranets and the Internet. Those skilled in the art will appreciate that such network computing environments will typically encompass many types of computer system configurations, including personal computers, hand-held devices, multiprocessor systems, microprocessor-based or programmable consumer electronics, network PCs, minicomputers, mainframe computers, and the like. Embodiments may also be practiced in distributed computing environments where tasks are performed by local and remote processing devices that are linked (either by hardwired links, wireless links, or by a combination of hardwired or wireless links) through a communications network. In a distributed computing environment, program modules may be located in both local and remote memory storage devices.

An example system for implementing the overall system or portions might include a general purpose computing device in the form of a conventional computer, including a processing unit, a system memory, and a system bus that couples various system components including the system memory to the processing unit. The system memory may include read only memory (ROM) and random access memory (RAM). The computer may also include a magnetic hard disk drive for reading from and writing to a magnetic hard disk, a magnetic disk drive for reading from or writing to a removable magnetic disk, and an optical disk drive for reading from or writing to removable optical disk such as a CD-ROM or other optical media. The drives and their associated computer readable media provide nonvolatile storage of computer executable instructions, data structures, program modules and other data for the computer.

Software and web implementations could be accomplished with standard programming techniques with rule based logic and other logic to accomplish the various database searching steps, correlation steps, comparison steps and decision steps. It should also be noted that the word “component” as used herein and in the claims is intended to encompass implementations using one or more lines of software code, hardware implementations, or equipment for receiving manual inputs.

Lawyer, Glenn

Patent Priority Assignee Title
Patent Priority Assignee Title
5153922, Jan 31 1991 Time varying symbol
7681131, Nov 10 2004 INTERNETPERILS, INC Method and apparatus for aggregating, condensing, supersetting, and displaying network topology and performance data
8311950, Oct 01 2009 GOOGLE LLC Detecting content on a social network using browsing patterns
8521782, Jul 20 2011 SALESFORCE, INC Methods and systems for processing large graphs using density-based processes using map-reduce
9524316, Nov 15 2013 Microsoft Technology Licensing, LLC Processing search queries using a data structure
9560065, Mar 22 2012 Triad National Security, LLC Path scanning for the detection of anomalous subgraphs and use of DNS requests and host agents for anomaly/change detection and network situational awareness
20040114539,
20040249824,
20070297374,
20110218960,
20130339358,
20140188994,
20140214946,
20150020199,
20150074122,
/
Executed onAssignorAssigneeConveyanceFrameReelDoc
Jan 30 2015Max-Planck-Gesellschaft zur Förderung der Wissenschaften e.V.(assignment on the face of the patent)
Date Maintenance Fee Events
Aug 24 2023M2551: Payment of Maintenance Fee, 4th Yr, Small Entity.


Date Maintenance Schedule
Mar 03 20234 years fee payment window open
Sep 03 20236 months grace period start (w surcharge)
Mar 03 2024patent expiry (for year 4)
Mar 03 20262 years to revive unintentionally abandoned end. (for year 4)
Mar 03 20278 years fee payment window open
Sep 03 20276 months grace period start (w surcharge)
Mar 03 2028patent expiry (for year 8)
Mar 03 20302 years to revive unintentionally abandoned end. (for year 8)
Mar 03 203112 years fee payment window open
Sep 03 20316 months grace period start (w surcharge)
Mar 03 2032patent expiry (for year 12)
Mar 03 20342 years to revive unintentionally abandoned end. (for year 12)