4.4. Using Apache Arrow Flight with Trovares xGT¶
The outline of this notebook:
Generate a table of edges with random data.
Bring edge table into an Arrow format.
Send Arrow Flight to Trovares xGT.
Request to run a graph query on xGT.
Bring results back as an Arrow Flight.
Visualize results.
4.4.1. Generate edges with random data.¶
First, set up parameters for edge table.
The table will have four columns. The first two columns representing source and target of each edge. The other two columns representing edge attributes.
The values of the first two columns are drawn randomly in the range of
1
to NUM_NODES
.
# Metadata of random data table
NUM_ROWS=100
NUM_NODES=10
# Generate table
import pandas as pd
import random
random.seed()
alpha='abcdefghijklmnopqrstuvwxyz'
data = {
'source': [random.randint(1, NUM_NODES) for _ in range(NUM_ROWS)],
'target': [random.randint(1, NUM_NODES) for _ in range(NUM_ROWS)],
'attr1' : [random.choice(alpha) for _ in range(NUM_ROWS)],
'attr2' : [random.choice(alpha) for _ in range(NUM_ROWS)],
}
df = pd.DataFrame(data)
4.4.2. Bring edge table into Arrow format.¶
import pyarrow as pa
import pyarrow.flight
arrow_edges = pa.Table.from_pandas(df)
arrow_edges
pyarrow.Table
source: int64
target: int64
attr1: string
attr2: string
----
source: [[6,6,9,2,4,...,7,3,8,9,10]]
target: [[7,1,10,6,9,...,7,7,6,4,3]]
attr1: [["x","u","u","r","k",...,"p","x","m","q","g"]]
attr2: [["a","u","j","f","d",...,"m","g","p","g","x"]]
arrow_edges.column_names
['source', 'target', 'attr1', 'attr2']
4.4.3. Send Arrow Flight to Trovares xGT.¶
First create frames on xGT to hold the graph data.
Then establish a PyArrow Flight connection to the xGT server and send the data.
import xgt
server = xgt.Connection()
server.set_default_namespace('arrow')
server.create_edge_frame_from_data(arrow_edges, 'Edge',
source='Node', target='Node',
source_key='source', target_key='target')
<xgt.graph.EdgeFrame at 0x1109d4370>
4.4.4. Request to run a graph query on xGT.¶
query = """
MATCH (a)-[e1:Edge]->(b)-[e2:Edge]->(c)
WHERE a <> b and b <> c and a <> c
AND e1.attr1 = e2.attr1
RETURN DISTINCT a.id AS a, c.id AS c, e1.attr1 AS attr1
INTO AnswerSet
"""
job = server.run_job(query)
data = job.get_data(format='pandas')
data
4.4.5. Bring results back¶
Note that this process uses the Arrow Flight mechanism to transfer the
AnswerSet
data from the xGT server to the client.
answers = server.get_frame('AnswerSet').get_data(format='arrow')
answers
pyarrow.Table
a: int64
c: int64
attr1: string
----
a: [[10,6,5,1,2,...,4,6,1,9,4]]
c: [[7,3,3,10,3,...,3,5,6,4,6]]
attr1: [["t","q","p","u","l",...,"q","o","r","p","q"]]
4.4.6. Visualize results.¶
#answers_table = pa.Table.from_pandas(answers)
try:
import graphviz
from IPython.display import Image
gr = graphviz.Digraph()
# Draw nodes for the union of all source and target nodes in the two paths
for node in set(answers.column('a')) | set(answers.column('c')):
gr.node(str(node))
# Draw edges for each returned edge
for batch in answers.to_batches():
d = batch.to_pydict()
for a, c, attr1 in zip(d['a'], d['c'], d['attr1']):
# Do something with the row
gr.edge(str(a), str(c), label=f'"{str(attr1)}"')
except ImportError:
print("You must install graphviz in order to vizualize the results as a graph")
print("\n!python3 -m pip install graphviz")
gr = None
gr