6.3. Jaccard Similarity¶

The single-source version of this problem is:

Given a specific vertex A, find other vertices B that are most similar to A.

This is done by computing a similarity metric for each vertex B and then rank-ordering the vertices by that score.

The Jaccard similarity metric is between two different vertices A and B. It is computed as the ratio of the intersection of their neighboring U vertex sets to the union of their neighbor sets. The diagram shows, for a given A and B, a neighborhood union of size 8. The intersection of the two neighborhoods include: {1, 3, 7}, so the intersection size is 3. If the intersection is the same size as the union, that means the two vertices have identical neighborhoods and are therefore considered to be most similar.

Two Vertices with Similar Neighborhoods¶

This metric is based on the graph topology and is not influenced by the vertex and edge property values. It may be more interesting to include data values as part of a Jaccard analysis.

The overall flow of this analysis using xGT involves these phases:

Build a connectivity graph to drastically reduce the search space
Find all two-paths from A to B through intermediate vertices Ux over the connectivity graph
Group all of the two-paths by the B vertex to get a count of the number of two paths from A to B (which is the size of the intersection)
Use the outdegree of B within the connectivity graph to compute the size of the union, which leads to the similarity metric

6.3.1. Connect to the xGT Server¶

Before embarking on phase 1, some prelimiaries are required to interact with the xGT server.

import xgt
import getpass
server=xgt.Connection(userid=getpass.getuser(),
                      credentials=getpass.getpass())
server.server_version

  Password: ········

  '1.4.0'

6.3.2. Retrieve Metadata about the Graph¶

Select specific parts of the graph to analyze:

A_VERTEX_KEY selects the key of the unique vertex to play the role of A. Note that this vertex is the single source and the Jaccard computation will find other vertices most similar to A.
EDGE_FRAME_NAME is the name of the edge frame connecting A to all of the B vertices.
NAMESPACE is the name of the namespace holding the desired edge frame.

From the selected information, the components of the graph in the xGT server are extracted.

A_VERTEX_KEY='VPN'
EDGE_FRAME_NAME='Netflow'
NAMESPACE='graph'
FULL_EDGE_FRAME_NAME=NAMESPACE + '__' + EDGE_FRAME_NAME

edge = server.get_edge_frame(FULL_EDGE_FRAME_NAME)
source_vertex = server.get_vertex_frame(edge.source_name)
target_vertex = server.get_vertex_frame(edge.target_name)
VERTEX_KEY = source_vertex.key

(source_vertex, edge, target_vertex, VERTEX_KEY)

  (<xgt.graph.VertexFrame at 0x1093b3af0>,
   <xgt.graph.EdgeFrame at 0x1093b3ee0>,
   <xgt.graph.VertexFrame at 0x1093d84c0>,
   'device')

6.3.3. Define Utility Functions¶

These utility functions are used during the phases to interact with the xGT server.

#---- Utility functions for launching queries
def run_query(server, query, table_name = NAMESPACE + "__answers",
               drop_answer_table=True, show_query=False):
    if drop_answer_table and table_name is not None:
        server.drop_frame(table_name)
    if query[-1] != '\n':
        query += '\n'
    if table_name is not None:
        query += 'INTO {}'.format(table_name)
    if show_query:
        print("Query:\n" + query)
    job = server.schedule_job(query)
    if show_query:
        print("Launched job {} at time: {}".format(job.id, time.asctime()))
    server.wait_for_job(job)
    if table_name is None:
        return None
    table = server.get_table_frame(table_name)
    return table

import time
def run_query_and_count(server, query, table_name = NAMESPACE + "__answers",
                        drop_answer_table=True, show_query=False):
    server.drop_frame(table_name)
    start_time = time.time()
    server.wait_for_metrics()
    wait_time = time.time() - start_time
    if wait_time > 30:
      print('Time to wait for metrics: {:3,.2f}'.format(wait_time))
    table = run_query(server, query, table_name, drop_answer_table, show_query)
    # Retrieve count
    count = table.get_data()[0][0]
    return count

# Utility to print the data sizes currently in xGT
def print_data_summary(server, frames=None):
    if frames is None:
        frames = server.get_table_frames() + server.get_vertex_frames() + server.get_edge_frames()
    if len(frames)==0:
        print("No data has been defined")
    for frame in frames:
        frame_name = frame.name
        if isinstance(frame, xgt.EdgeFrame):
            frame_type = 'edge'
        elif isinstance(frame, xgt.VertexFrame):
            frame_type = 'vertex'
        elif isinstance(frame, xgt.TableFrame):
            frame_type = 'table'
        else:
            frame_type = 'unknown'
        print("{} ({}): {:,}".format(frame_name, frame_type, frame.num_rows))

print_data_summary(server)

  graph__Devices (vertex): 157,875
  graph__Netflow (edge): 222,323,503

6.3.4. Phase 1: Build a Connectivity Graph¶

Between any two vertices, a -> b, there may be many edges, perhaps millions.

To compute Jaccard similarity, all we need to know is if there is at least one or if there are zero edges.

We superimpose a connectivity graph on the same vertex set and leave either zero or one edges between every pair of vertices.

This phase consists of finding distinct vertex pairs with at least one edge. Note that this results in a table with the source and target vertex keys. From there we populate a new edge frame.

# create connectivity edge
server.drop_frame(NAMESPACE + '__ConnectedTo')
edge_schema = dict(edge.schema)
connected_to = server.create_edge_frame(
    name=NAMESPACE + '__ConnectedTo',
    schema=[[edge.source_key, edge_schema[edge.source_key]],
            [edge.target_key, edge_schema[edge.target_key]]],
    source=edge.source_name,
    target=edge.target_name,
    source_key=edge.source_key,
    target_key=edge.target_key)
connected_to

  <xgt.graph.EdgeFrame at 0x1093b3a90>

%%time
# Find distinct pairs of vertices over edge frame
q = """
MATCH (a)-[e1:{edge_frame}]->(b)
WHERE unique_vertices(a, b)
RETURN DISTINCT a.{vertex_key} AS a, b.{vertex_key} AS b
""".format(edge_frame=FULL_EDGE_FRAME_NAME, vertex_key=VERTEX_KEY)

table = run_query(server, q, table_name=NAMESPACE + '__DistinctConnections', show_query=True)
print("Number of intermediate rows: {:,}".format(table.num_rows))

  Query:

  MATCH (a)-[e1:graph__Netflow]->(b)
  WHERE unique_vertices(a, b)
  RETURN DISTINCT a.device AS a, b.device AS b
  INTO graph__DistinctConnections
  Launched job 126 at time: Wed Jul 22 08:57:19 2020
  Number of intermediate rows: 559,680
  CPU times: user 7.15 ms, sys: 4.68 ms, total: 11.8 ms
  Wall time: 7.7 s

%%time
# From table of distinct vertex pairs, populate connectivity edge
q = """
MATCH (ab:{distinct_connections})
MERGE (a : {source_vertex_frame_name} {{ {source_vertex_key} : ab.a }})
MERGE (b : {target_vertex_frame_name} {{ {target_vertex_key} : ab.b }})
CREATE (a)-[e: {connected_to} {{ }}]->(b)
""".format(distinct_connections=NAMESPACE + '__DistinctConnections',
           source_vertex_frame_name=edge.source_name, target_vertex_frame_name=edge.target_name,
           source_vertex_key=source_vertex.key, target_vertex_key=target_vertex.key,
           connected_to=NAMESPACE + '__ConnectedTo')

run_query(server, q, table_name=None, show_query=True)
server.drop_frame(NAMESPACE + '__DistinctConnections')

  Query:

  MATCH (ab:graph__DistinctConnections)
  MERGE (a : graph__Devices { device : ab.a })
  MERGE (b : graph__Devices { device : ab.b })
  CREATE (a)-[e: graph__ConnectedTo { }]->(b)

  Launched job 176 at time: Wed Jul 22 08:57:27 2020
  CPU times: user 4.76 ms, sys: 2.45 ms, total: 7.2 ms
  Wall time: 218 ms

  True

6.3.5. Phase 2: Count Two Paths from A to B¶

This phase finds two-paths from the A vertex to all other B vertices through an intermediate vertex U.

Since A is fixed, we retain distinct combinations of U and B, along with the sum of the outdegrees of A and B, which we use in a later phase. This outdegree sum turns out to be the same as the sum of the size of the union of A and B neighborhoods and the size of their intersection.

%%time
import pandas
q = """
MATCH (a)-[e1:{connected_to}]->(u)<-[e2:{connected_to}]-(b)
WHERE unique_vertices(a, b, u)
  AND a.{vertex_key} = "{a_vertex_key}"
RETURN DISTINCT b.{vertex_key} AS b, u.{vertex_key} as u,
       outdegree(a, {connected_to}) + outdegree(b, {connected_to}) AS outDegreeSum
""".format(connected_to=NAMESPACE + '__ConnectedTo',
           vertex_key=source_vertex.key, a_vertex_key=A_VERTEX_KEY)

table = run_query(server, q, table_name=NAMESPACE + '__phase2', show_query=True)
print("Number of intermediate rows: {:,}".format(table.num_rows))
if table.num_rows > 100:
    print("Showing the first 100 rows:")
    df=table.get_data_pandas(0, 100)
else:
    df=table.get_data_pandas()
df

  Query:

  MATCH (a)-[e1:graph__ConnectedTo]->(u)<-[e2:graph__ConnectedTo]-(b)
  WHERE unique_vertices(a, b, u)
    AND a.device = "VPN"
  RETURN DISTINCT b.device AS b, u.device as u,
         outdegree(a, graph__ConnectedTo) + outdegree(b, graph__ConnectedTo) AS outDegreeSum
  INTO graph__phase2
  Launched job 190 at time: Wed Jul 22 08:57:29 2020
  Number of intermediate rows: 256,417
  Showing the first 100 rows:
  CPU times: user 420 ms, sys: 139 ms, total: 559 ms
  Wall time: 2.07 s

	b	u	outDegreeSum
0	Comp853122	Comp072974	614
1	Comp217088	Comp008822	620
2	Comp077180	Comp857132	605
3	Comp873655	Comp908480	613
4	Comp064818	Comp184712	605
...	...	...	...
95	Comp356840	Comp604291	628
96	Comp281329	Comp275646	605
97	Comp639272	Comp037853	595
98	Comp970485	Comp398468	612
99	Comp817584	Comp275646	624

100 rows × 3 columns

6.3.6. Phase 3: Group By B and Count Number of U Vertices¶

The count of distinct U vertices for A -> U <- B represents the size of the intersection of the neighborhoods of A and B.

We also carry forward the sum of the outdegree of A and B to the next phase.

%%time
q = """
MATCH (row:{phase2})
RETURN row.b AS b, row.outDegreeSum AS outDegreeSum,
       COUNT(*) AS intersectionSize
""".format(phase2=NAMESPACE + '__phase2')
table = run_query(server, q, table_name=NAMESPACE + '__phase3')
server.drop_frame(NAMESPACE + '__phase2')
df = table.get_data_pandas()
df

  CPU times: user 80.9 ms, sys: 9.06 ms, total: 90 ms
  Wall time: 474 ms

	b	outDegreeSum	intersectionSize
0	Comp576990	591	12
1	Comp250033	597	13
2	Comp125251	589	10
3	Comp753523	607	28
4	Comp528543	580	1
...	...	...	...
24615	Comp704389	595	15
24616	Comp505509	603	23
24617	Comp181012	585	4
24618	Comp747115	619	30
24619	IP479316	585	2

24620 rows × 3 columns

6.3.7. Phase 4: Compute Jaccard Score¶

We now scan the latest intermediate table and compute the Jaccard score from the counts computed in previous phases.

%%time
q = """
MATCH (row:{phase3})
RETURN row.b AS b, row.intersectionSize AS intersectionSize,
       row.outDegreeSum - row.intersectionSize AS union_size,
       (row.intersectionSize + 0.0) / (row.outDegreeSum - row.intersectionSize) AS jaccard_score
ORDER BY jaccard_score DESC
LIMIT 50
""".format(phase3=NAMESPACE + '__phase3')
table = run_query(server, q, table_name=NAMESPACE + '__jaccard')
server.drop_frame(NAMESPACE + '__phase3')
df = table.get_data_pandas()
df

  CPU times: user 7.46 ms, sys: 2.78 ms, total: 10.2 ms
  Wall time: 147 ms

	b	intersectionSize	union_size	jaccard_score
0	Comp678443	87	665	0.130827
1	Comp030334	74	781	0.094750
2	Comp965575	72	780	0.092308
3	Comp257274	72	782	0.092072
4	Comp866402	72	783	0.091954
5	Comp211931	50	580	0.086207
6	Comp769797	50	584	0.085616
7	Comp148004	54	637	0.084772
8	Comp148684	49	580	0.084483
9	Comp464369	49	583	0.084048
10	Comp889106	50	598	0.083612
11	Comp501848	48	586	0.081911
12	Comp415532	48	586	0.081911
13	Comp725432	48	592	0.081081
14	Comp468329	47	583	0.080617
15	Comp304006	47	583	0.080617
16	Comp055001	47	585	0.080342
17	Comp211432	46	582	0.079038
18	Comp377866	46	584	0.078767
19	Comp507194	46	584	0.078767
20	Comp613192	46	585	0.078632
21	Comp771132	46	586	0.078498
22	Comp599337	46	587	0.078365
23	Comp719990	45	582	0.077320
24	Comp643534	46	595	0.077311
25	Comp881722	45	583	0.077187
26	Comp493395	45	583	0.077187
27	Comp614801	45	585	0.076923
28	Comp764436	45	586	0.076792
29	Comp302339	45	592	0.076014
30	Comp991220	46	608	0.075658
31	Comp130208	44	582	0.075601
32	Comp924691	45	598	0.075251
33	Comp943564	44	585	0.075214
34	Comp018032	44	585	0.075214
35	Comp498377	44	590	0.074576
36	Comp210768	44	592	0.074324
37	Comp688526	43	581	0.074010
38	Comp958822	43	581	0.074010
39	Comp591316	43	582	0.073883
40	Comp856137	43	582	0.073883
41	Comp354119	43	582	0.073883
42	Comp617264	43	582	0.073883
43	Comp498743	43	583	0.073756
44	Comp559728	43	584	0.073630
45	Comp084630	44	598	0.073579
46	Comp356840	43	585	0.073504
47	Comp077821	43	585	0.073504
48	Comp452053	43	585	0.073504
49	Comp959644	43	587	0.073254