2.19. Query Optimization

As described in The Trovares Query Language (TQL), a query in xGT’s TQL language can have a fairly complex description of the shape of the pattern to match (or “structure”) in the large data graph stored in the system’s memory.

The structure could include a number of topology elements coming from multiple vertex and edge frames in xGT. In addition, the direction in which edges connect vertex topology elements can be specified as part of the structure.

A TQL query will also usually include constraints on the values of the properties of the topology elements specified in the structure.

An example of a complex query with multiple topology elements and property constraints is the following:

MATCH (a)-[boot:HostEvents]->(a)-[program:HostEvents]->(a)
         -[nf1:Netflow]->(b)-[nf2:Netflow]->(c)
WHERE unique_vertices(a, b, c) AND boot.event_id = 4608 AND program.event_id = 4688
AND nf1.src_port = 3128 AND nf2.duration >= 3600
AND boot.epoch_time <= program.epoch_time
AND program.epoch_time <= nf1.epoch_time
AND nf1.epoch_time - boot.epoch_time < 4
AND nf2.epoch_time < nf1.epoch_time
AND nf2.epoch_time + nf2.duration >= nf1.epoch_time

This example, taken from the LANL Jupyter Notebooks data set and queries, illustrates the case of a complex query with multiple vertex and edge topology elements as well as a significant number of property constraints on them.

The TQL representation of a query like this is a declarative form. There are multiple valid orders in which the topology elements may be visited and their properties evaluated against the constraints. For example, in addition to the order provided by the user, some other valid orders include:

(c)<-[nf2:Netflow]-(b)<-[nf1:Netflow]-(a)
    -[boot:HostEvents]->(a)-[program:HostEvents]->(a)

(b)<-[nf1:Netflow]-(a)-[boot:HostEvents]->(a)
    -[program:HostEvents]->(a),
(b)-[nf2:Netflow]->(c)

(a)-[program:HostEvents]->(a)-[nf1:Netflow]->(b)
   -[nf2:Netflow]->(c),
(a)-[boot:HostEvents]->(a)

To efficiently execute a query, xGT must decide on a particular order in which to visit the topology elements specified in the query as well as evaluate their property values to see if they satisfy the specified constraints. Query order optimization is the process by which xGT decides on a particular sequence of operations to execute that implement the query as described by the user in the TQL declarative form.

xGT uses a statistical approach that samples the large data graph as it is being populated with data. Samples are captured related to the topological connection of vertices and edges from different frames, as well as property value frequencies in the frames.

These statistical samples are used to estimate, for example, whether it will be faster to start a query by iterating over a vertex frame with just a few hundred instances or an edge frame with a large number of instances, but with a property constraint that limits the matching edges to just ten edges. Many possible orders are considered and the plan with the lowest cost estimate is the order that is used.

Query order optimization is turned on by default but can be turned off by changing the optimization level for a query. If it is turned off, xGT will use the order provided by the user in the query string. To turn off query optimization for a single query, give an optimization level of 3 or less when starting a job with run_job() or schedule_job(). Use optimization level 4 to enable order optimization.

The query order optimization will be most effective if statistical samples of the graph have been collected. This statistical sampling can be turned off and on using the configuration setting metrics.cache described in Configuring the Server, but it is on by default. If turned on, statistics collection will occur in the background after data is loaded or modified. If turned off, query order optimization can still occur, but the query plan will be chosen with less information.

A user can run a query immediately after loading data while statistics are still being collected. If metrics used for order optimization are still being updated, a plan order will be chosen using incomplete statistical values.

2.19.1. Checking the Status of Statistics Collection

To ensure that statistics are updated before running a query, a user can check the status of metrics collection using wait_for_metrics(), which blocks until statistics are no longer being updated. The return value is true if metrics collection is turned on and has finished updating and false if either metrics collection is turned off or if the method returned due to a timeout. Below is an example usage.

#-- Load data into an existing frame --
nf = conn.get_edge_frame('Netflow')
nf.load('xgtd://edges.csv')

#-- Wait for statistics collection --
finished = conn.wait_for_metrics()

#-- Run a query with query order optimization on by default --
conn.run_job(query)

Another option is to use the non-blocking method get_metrics_status(), which returns a string describing the state of statistics collection. If metrics collection is turned off the return value will be metrics_off. If it is turned on and completed the return value will be metrics_completed. If statistics are still being computed, the return value will be metrics_running.

Note that statistics may be updated when the data is changed either due to a load(), an insert(), or running a TQL query that modifies the graph.