4.3. Jaccard Similarity¶
The single-source version of this problem is:
Given a specific vertex A, find other vertices B that are most similar to A.
This is done by computing a similarity metric for each vertex B and then rank-ordering the vertices by that score.
The Jaccard similarity metric is between two different vertices A and B. It is computed as the ratio of the intersection of their neighboring U vertex sets to the union of their neighbor sets. The diagram shows, for a given A and B, a neighborhood union of size 8. The intersection of the two neighborhoods include: {1, 3, 7}, so the intersection size is 3. If the intersection is the same size as the union, that means the two vertices have identical neighborhoods and are therefore considered to be most similar.
This metric is based on the graph topology and is not influenced by the vertex and edge property values. It may be more interesting to include data values as part of a Jaccard analysis.
The overall flow of this analysis using xGT
involves these phases:
Build a connectivity graph to drastically reduce the search space
Compute the Jaccard Scores
4.3.1. Connect to the xGT Server¶
Before embarking on phase 1, some prelimiaries are required to interact
with the xGT
server.
import xgt
import getpass
server=xgt.Connection(auth = xgt.BasicAuth(getpass.getuser(),
getpass.getpass()))
server.server_version
········
'1.8.0'
4.3.2. Retrieve Metadata about the Graph¶
Select specific parts of the graph to analyze:
A_VERTEX_KEY
selects the key of the unique vertex to play the role of A. Note that this vertex is the single source and the Jaccard computation will find other vertices most similar to A.EDGE_FRAME_NAME
is the name of the edge frame connecting A to all of the B vertices.
From the selected information, the components of the graph in the
xGT
server are extracted.
A_VERTEX_KEY='VPN'
EDGE_FRAME_NAME='Netflow'
server.set_default_namespace('lanl')
edge = server.get_frame(EDGE_FRAME_NAME)
source_vertex = server.get_frame(edge.source_name)
target_vertex = server.get_frame(edge.target_name)
VERTEX_KEY = source_vertex.key
(source_vertex, edge, target_vertex, VERTEX_KEY)
(<xgt.graph.VertexFrame at 0x111686580>,
<xgt.graph.EdgeFrame at 0x111686c40>,
<xgt.graph.VertexFrame at 0x1116868b0>,
'device')
4.3.3. Define Utility Functions¶
These utility functions are used during the phases to interact with the
xGT
server.
# Utility to print the data sizes currently in xGT
def print_data_summary(server):
frames = (server.get_frames())
if len(frames)==0:
print("No data has been defined")
for frame in frames:
frame_name = frame.name
if isinstance(frame, xgt.EdgeFrame):
frame_type = 'edge'
elif isinstance(frame, xgt.VertexFrame):
frame_type = 'vertex'
elif isinstance(frame, xgt.TableFrame):
frame_type = 'table'
else:
frame_type = 'unknown'
print("{} ({}): {:,}".format(frame_name, frame_type, frame.num_rows))
print_data_summary(server)
lanl__Devices (vertex): 157,875
lanl__Netflow (edge): 222,323,503
4.3.4. Phase 1: Build a Connectivity Graph¶
Between any two vertices, a -> b
, there may be many edges, perhaps
millions.
To compute Jaccard similarity, all we need to know is if there is at least one or if there are zero edges.
We superimpose a connectivity graph on the same vertex set and leave either zero or one edges between every pair of vertices.
This phase consists of finding distinct vertex pairs with at least one edge. Note that this results in a table with the source and target vertex keys. From there we populate a new edge frame.
# create connectivity edge
connected_to_edge = 'ConnectedTo'
server.drop_frame(connected_to_edge)
edge_schema = dict(edge.schema)
connected_to = server.create_edge_frame(
name=connected_to_edge,
schema=[[edge.source_key, edge_schema[edge.source_key]],
[edge.target_key, edge_schema[edge.target_key]]],
source=edge.source_name,
target=edge.target_name,
source_key=edge.source_key,
target_key=edge.target_key)
connected_to
<xgt.graph.EdgeFrame at 0x111188af0>
%%time
# Find distinct pairs of vertices over edge frame, create one edge for each distinct pair
q = """
MATCH (a)-[e1:{edge_frame}]->(b)
WHERE unique_vertices(a, b)
WITH DISTINCT a, b
CREATE (a)-[e: {connected_to_edge} {{ }}]->(b)
RETURN count(*) AS count
""".format(edge_frame=EDGE_FRAME_NAME,
connected_to_edge=connected_to_edge
)
print("query:\n\n{}".format(q))
job = server.run_job(q, parameters={})
print("Number of 'connected_to' edges: {:,}".format(job.get_data()[0][0]))
query:
MATCH (a)-[e1:Netflow]->(b)
WHERE unique_vertices(a, b)
WITH DISTINCT a, b
CREATE (a)-[e: ConnectedTo { }]->(b)
RETURN count(*) AS count
Number of 'connected_to' edges: 559,680
CPU times: user 5.85 ms, sys: 6.31 ms, total: 12.2 ms
Wall time: 9.13 s
4.3.5. Phase 2: Compute the Jaccard scores¶
This phase finds two-paths from the A vertex to all other B vertices through an intermediate vertex U.
From this information, for each vertex B, the intersection size and union size are computed followed by the ratio (which is the Jaccard score).
We rely on the observation that the outdegree sum is the same as the size of the union of A and B neighborhoods.
We also rely on the observation that the number of two-paths from A to B is the same as the size of the intersection of the two neighborhoods.
%%time
import pandas
q = """
MATCH (a)-[e1:{connected_to_edge}]->(u)<-[e2:{connected_to_edge}]-(b)
WHERE unique_vertices(a, b, u)
AND a.{vertex_key} = $a_vertex_key
WITH DISTINCT b, u,
outdegree(a, {connected_to_edge}) + outdegree(b, {connected_to_edge}) AS out_degree_sum
WITH b, out_degree_sum, count(*) AS intersection_size
WITH b, intersection_size, out_degree_sum - intersection_size AS union_size
RETURN
b.{vertex_key} AS b,
intersection_size,
union_size,
toFloat(intersection_size) / union_size AS jaccard_score
ORDER BY jaccard_score DESC
LIMIT 50
""".format(connected_to_edge=connected_to_edge, vertex_key=source_vertex.key)
print("query:\n\n{}".format(q))
job = server.run_job(q, parameters={"a_vertex_key":A_VERTEX_KEY})
df=job.get_data(format='pandas')
df
query:
MATCH (a)-[e1:ConnectedTo]->(u)<-[e2:ConnectedTo]-(b)
WHERE unique_vertices(a, b, u)
AND a.device = $a_vertex_key
WITH DISTINCT b, u,
outdegree(a, ConnectedTo) + outdegree(b, ConnectedTo) AS out_degree_sum
WITH b, out_degree_sum, count(*) AS intersection_size
WITH b, intersection_size, out_degree_sum - intersection_size AS union_size
RETURN
b.device AS b,
intersection_size,
union_size,
toFloat(intersection_size) / union_size AS jaccard_score
ORDER BY jaccard_score DESC
LIMIT 50
CPU times: user 538 ms, sys: 260 ms, total: 798 ms
Wall time: 2.8 s
b | intersection_size | union_size | jaccard_score | |
---|---|---|---|---|
0 | Comp678443 | 87 | 665 | 0.130827 |
1 | Comp030334 | 74 | 781 | 0.094750 |
2 | Comp965575 | 72 | 780 | 0.092308 |
3 | Comp257274 | 72 | 782 | 0.092072 |
4 | Comp866402 | 72 | 783 | 0.091954 |
5 | Comp211931 | 50 | 580 | 0.086207 |
6 | Comp769797 | 50 | 584 | 0.085616 |
7 | Comp148004 | 54 | 637 | 0.084772 |
8 | Comp148684 | 49 | 580 | 0.084483 |
9 | Comp464369 | 49 | 583 | 0.084048 |
10 | Comp889106 | 50 | 598 | 0.083612 |
11 | Comp415532 | 48 | 586 | 0.081911 |
12 | Comp501848 | 48 | 586 | 0.081911 |
13 | Comp725432 | 48 | 592 | 0.081081 |
14 | Comp304006 | 47 | 583 | 0.080617 |
15 | Comp468329 | 47 | 583 | 0.080617 |
16 | Comp055001 | 47 | 585 | 0.080342 |
17 | Comp211432 | 46 | 582 | 0.079038 |
18 | Comp377866 | 46 | 584 | 0.078767 |
19 | Comp507194 | 46 | 584 | 0.078767 |
20 | Comp613192 | 46 | 585 | 0.078632 |
21 | Comp771132 | 46 | 586 | 0.078498 |
22 | Comp599337 | 46 | 587 | 0.078365 |
23 | Comp719990 | 45 | 582 | 0.077320 |
24 | Comp643534 | 46 | 595 | 0.077311 |
25 | Comp881722 | 45 | 583 | 0.077187 |
26 | Comp493395 | 45 | 583 | 0.077187 |
27 | Comp614801 | 45 | 585 | 0.076923 |
28 | Comp764436 | 45 | 586 | 0.076792 |
29 | Comp302339 | 45 | 592 | 0.076014 |
30 | Comp991220 | 46 | 608 | 0.075658 |
31 | Comp130208 | 44 | 582 | 0.075601 |
32 | Comp924691 | 45 | 598 | 0.075251 |
33 | Comp018032 | 44 | 585 | 0.075214 |
34 | Comp943564 | 44 | 585 | 0.075214 |
35 | Comp498377 | 44 | 590 | 0.074576 |
36 | Comp210768 | 44 | 592 | 0.074324 |
37 | Comp688526 | 43 | 581 | 0.074010 |
38 | Comp958822 | 43 | 581 | 0.074010 |
39 | Comp591316 | 43 | 582 | 0.073883 |
40 | Comp856137 | 43 | 582 | 0.073883 |
41 | Comp354119 | 43 | 582 | 0.073883 |
42 | Comp617264 | 43 | 582 | 0.073883 |
43 | Comp498743 | 43 | 583 | 0.073756 |
44 | Comp559728 | 43 | 584 | 0.073630 |
45 | Comp084630 | 44 | 598 | 0.073579 |
46 | Comp356840 | 43 | 585 | 0.073504 |
47 | Comp077821 | 43 | 585 | 0.073504 |
48 | Comp452053 | 43 | 585 | 0.073504 |
49 | Comp959644 | 43 | 587 | 0.073254 |
server.drop_frame(connected_to_edge)
True