Query Order Optimization in xGT

As described in Introduction to TQL, a query in xGT's TQL language can have a fairly complex description of the shape of the pattern to match (or "structure") in the large data graph stored in the system's memory.

The structure could include a number of topology elements coming from multiple vertex and edge frames in xGT. In addition, the direction in which edges connect vertex topology elements can be specified as part of the structure.

A TQL query will also usually include constraints on the values of the properties of the topology elements specified in the structure.

An example of a complex query with multiple topology elements and property constraints is the following:

MATCH (a)-[boot:HostEvents]->(a)-[program:HostEvents]->(a)-[nf1:Netflow]->(b)-[nf2:Netflow]->(c)
WHERE unique_vertices(a, b, c) AND boot.event_id = 4608 AND program.event_id = 4688
AND nf1.src_port = 3128 AND nf2.duration >= 3600
AND boot.epoch_time <= program.epoch_time
AND program.epoch_time <= nf1.epoch_time
AND nf1.epoch_time - boot.epoch_time < 4
AND nf2.epoch_time < nf1.epoch_time
AND nf2.epoch_time + nf2.duration >= nf1.epoch_time
RETURN count(*)
INTO Results

This example, taken from the LANL Netflow data set and queries, illustrates the case of a complex query with multiple vertex and edge topology elements as well as a significant number of property constraints on them.

The TQL representation of a query like this is a declarative form. There are multiple valid orders in which the topology elements may be visited and their properties evaluated against the constraints. For example, in addition to the order provided by the user, some other valid orders include:

(c)<-[nf2:Netflow]-(b)<-[nf1:Netflow]-(a)-[boot:HostEvents]->(a)-[program:HostEvents]->(a)

(b)<-[nf1:Netflow]-(a)-[boot:HostEvents]->(a)-[program:HostEvents]->(a),
(b)-[nf2:Netflow]->(c)

(a)-[program:HostEvents]->(a)-[nf1:Netflow]->(b)-[nf2:Netflow]->(c)
(a)-[boot:HostEvents]->(a)

To efficiently execute a query, xGT must decide on a particular order in which to visit the topology elements specified in the query as well as evaluate their property values to see if they satisfy the specified constraints. Query Order Optimization is the process by which xGT decides on a particular sequence of operations to execute that implement the query as described by the user in the TQL declarative form.

xGT uses a statistical approach that samples the large data graph as it is being populated with data. Samples are captured related to the topological connection of vertices and edges from different frames, as well as property value frequencies in the frames.

These statistical samples are used to estimate, for example, whether it will be faster to start a query by iterating over a vertex frame with just a few hundred instances or an edge frame with a large number of instances, but with a property constraint that limits the matching edges to just ten edges. Many possible orders are considered and the plan with the lowest cost estimate is the order that is used.

Query order optimization is turned on by default but can be turned off either for all new queries or on a per query basis by changing the query optimization level. If it is turned off, xGT will use the order provided by the user in the query string. The default query optimization level is 4, and an optimization level of 3 or less will turn off query optimization. See the API documentation for Connection.set_optimization_level(), Connection.run_job(), or Connection.schedule_job() for more information about query optimization levels.

To turn off query optimization for all new queries, call Connection.set_optimization_level(optlevel = 3) with an optimization level of 3 or less. To turn off query optimization for a single query, give an optimization level of 3 or less when starting a job with Connection.run_job(optlevel = 3) or Connection.schedule_job(optlevel = 3).

The query order optimization will be effective only if statistical samples have been collected. This can be turned off and on using the configuration setting metrics.cache described in Configuring xgtd, but it is on by default.

If turned on, statistics collection will occur in the background after data is loaded or modified. A user can run a query immediately after loading data while statistics are still being collected. In that case, if metrics needed for order optimization are still being updated, xGT may either use stale or incomplete values or skip order optimization for that query.

Checking the status of statistics collection

To ensure that statistics are updated before running a query, a user can check the status of metrics collection using wait_for_metrics(), which blocks until statistics are no longer being updated. The return value of wait_for_metrics() is true if metrics collection is turned on and has finished updating and false if either metrics collection is turned off or if the method returned due to a timeout. Below is an example usage.

#-- Load data into an existing frame --
nf = conn.get_edge_frame('Netflow')
nf.load('xgtd://edges.csv')

#-- Wait for statistics collection --
finished = conn.wait_for_metrics()

#-- Run a query with query order optimization on by default --
conn.run_job(query)

Another option is to use the non-blocking method get_metrics_status(), which returns a string describing the state of statistics collection. If metrics collection is turned off the return value will be metrics_off. If it is turned on and completed the return value will be metrics_completed. If statistics are still being computed, the return value will be metrics_running.

Note that statistics may be updated when the data is changed either due to a load(), an insert(), or running a TQL query that modifies the graph.