Introduction to the Trovares Query Language (TQL)

At this point you have loaded data into the system and defined vertex and edge frames with their associated schemas in order to represent your data as a graph. This document discusses how to phrase and ask interesting queries on graph data stored in the xGT system.

The Trovares Query Language (TQL) uses a subset of the Cypher language to express queries. The supported subset of Cypher enables powerful and expressive queries, while taking advantage of xGT's strongly typed graph elements in order to achieve very high performance and scalability.

A TQL query consists of several components (some of which are optional):

TQL's Cypher subset has specific syntax for each component of a query:

MATCH <structure>
WHERE <optional constraints on properties>
SET <optional property modifications>
MERGE <optional additions of vertices>
CREATE <optional additions of vertices and edges>
DETACH DELETE <optional deletions of vertices>
DELETE <optional deletions of edges>
RETURN <description of the answer set>
<optional solution modifiers>
INTO <results table name>

Note that if there are no constraints supplied, the WHERE keyword should not be present in the query.

We describe each of these components in detail.

Structure

The basic concept behind querying graph data in xGT is the representation of what to search for in the graph. A structure is a template of what to match in the large data graph stored in xGT's system memory.

xGT uses the query structure as a template to match elements of a large data graph and extract them into an answer set.

More precisely, the structure describes the type and topology of the graph elements that are part of the answer.

The type of a graph element is its membership in a particular vertex or edge frame. Each instance of a vertex or edge in xGT can belong to only one frame.

A vertex frame is a collection of vertices that have the same property names and types. We can think of these vertices as representing the same type of entity. For example, in an employment graph, all vertices representing a person may belong to a Person frame, while all vertices representing a company may belong to a Company frame. Similarly, an edge frame is a collection of edges that share the same property names and types.

The topology of the graph elements describes how vertices and edges connect to each other. The topology is restricted by the types of the vertices and edges. The declaration of an edge frame includes the frames of vertices it connects (one frame for the source of the edge and one for the target).

TQL's Cypher subset expresses the structure as a syntactic construct in its MATCH statements. The structure is described as a collection of connected vertices and edges. Each vertex is represented by a variable enclosed in parentheses (), while each edge is represented by a variable surrounded by brackets [].

Inside each vertex or edge component, there are two elements that can be provided. The first one is an annotation to indicate to which frame (type) the vertex or edge must belong. This is expressed by the use of a colon followed by a TQL type name: (:People) or [:WorksFor]. In these examples, the vertex must belong to the People frame and the edge must belong to the WorksFor frame. An optional variable name can be added to the vertex or edge to bind that graph element to a particular name. Consider the examples of (a:People) and [b:WorksFor]. In these cases the graph elements that match can be referred to in the rest of the query by the names a and b. If a variable is not given a name, it will force the specified structure but will remain anonymous. It cannot be used in value constraints, involved in graph cycles in the query, and its properties cannot be extracted in answer set.

In xGT, edges are directed, and the structure must indicate the edge direction desired in the resulting answer. Vertex components of the pattern used as sources are indicated by the use of a - character connecting the vertex to the edge: (:People)-[:WorksFor]. Vertex components of the pattern used as targets are indicated by the use of the -> character pair connecting the edge to the vertex: (:People)-[:WorksFor]->(:Companies). This example describes a pattern that matches persons that work for a company. The WorksFor edge goes from a person to the company that person works for. The edge is said to be outgoing from the person vertex and incoming to the company vertex. It is also possible to write the same pattern in the reverse direction: (:Companies)<-[:WorksFor]-(:People). Despite appearing reversed on paper, the query described is identical. Using the two different edge directions is a convenient facility when composing larger patterns with multiple edge and vertex frames.

Putting it all together

A structure in TQL is composed of one or more sequences of vertex-edge-vertex subsequences, where the types of the components must be compatible. An example of a longer pattern is as follows:

MATCH (:Companies)<-[:WorksFor]-(a:People)-[:FriendOf]->[b:People]

In this example, we make use of several of TQL's pattern facilities, including forward and reverse directions in the edges, multiple vertex frames (People and Companies) and multiple edge frames (WorksFor and FriendOf). We also bind the two people in the query to variables a and b.

Non-linear patterns

So far, we have discussed linear patterns in the sense that all vertices and edges in the pattern follow a sequence in the graph's topology. Describing more complex patterns is done using the , (comma) operator. The best way to think about these patterns is that they "branch off" from a linear pattern into another part of the graph.

For example:

MATCH (:People)-[:WorksFor]->(c:Companies)-[:IsLocated]->(:Cities),
      (c)-[:CompetesAgainst]->(:Companies)

In this case, the first linear pattern describes people working for companies of interest and where they are located. The second linear pattern is connected to the first one by the intermediate company vertex c. More complex structure patterns can consist of multiple connected linear patterns in this fashion.

Constraints on properties

We have discussed how to express the topological part of a TQL query using structures. We now discuss how to express constraints on the data stored in the components of the graph.

Recall that each graph component in xGT has an associated schema with named properties. Each property in the schema has an associated data type.

Constraints on the properties of the graph elements are described in the WHERE clause of a TQL query. The WHERE clause consists of expressions involving the properties of the graph elements, combined with constants (of the appropriate data type) and commonly used comparison, arithmetic, and boolean operators.

The combination of the structure and constraints in a query is called the graph pattern in TQL.

Properties are referred to by using the . (dot) operator in between the name of a bound variable and the name of a property: a.name would access the property named name of the graph element bound to the variable a.

Property expressions can be combined with other property expressions and constants using TQL's Cypher operator subset.

Supported operators are as follows:

Constants are supported for the following data types:

Parentheses () can be used to indicate precedence when dealing with nested expressions.

Examples of WHERE constraints are as follows:

WHERE p.name = "John" AND p.age < 40
WHERE c.value > 10.0 OR c.value < 2.5
WHERE (p.name STARTS WITH "D") AND (p.address IS NOT NULL)
WHERE (c.value > 10.0) OR (c.value < d.value)

Functions in constraint expressions

TQL provides a set of functions that can be used in constraint expressions. These functions include degree operations on vertices and a convenience form of expressing that vertices must be unique.

The degree functions can take one or two arguments. The first argument is always a bound variable to a vertex component in the structure. The optional second argument is the name of an edge frame in xGT.

Degree computations without the optional edge frame name are global, in the sense that the degree of the bound vertex is computed across all edge frames in xGT. When using the optional edge frame name parameter, the degree computation becomes relative to that particular edge frame. That is, the degree of the vertex is computed only for edges of the named frame.

TQL supports the following degree functions:

Examples are as follows:

WHERE indegree(a) = 10
WHERE outdegree(b) > 10
WHERE (outdegree(c, FriendOf) + outdegree(c, WorksFor)) < 5

By default xGT and TQL do not impose restrictions on the identity of the vertices and edges in a structure. In particular, cycles are allowed (graph paths from one vertex to the same vertex) and will be reported if present in the data. There are cases where the identity of the vertices in a query is not important, but there are also cases in which at least some of the vertices must be different from each other.

It is easy to express distinct vertices with just two: a <> b, but if we have more—say a, b, c and d—then expressing all the pairwise constraints becomes tedious and error-prone: a <> b AND a <> c AND a <> d AND b <> c AND b <> d AND c <> d. For this reason, TQL provides a shortcut. The function unique_vertices() can be added as part of a WHERE clause with its arguments being the bound variables for vertices that must be distinct from each other. xGT automatically generates all the necessary constraints and adds them to any other user-provided constraints in the query.

TQL also supports the Cypher-standard function toString() that converts any data type into a string representation. toString() can be applied to properties and expressions of any type. Some examples are toString(a.property), toString(10.5 + a.float_property). The toString() function can be used in any context where a string expression is valid, including WHERE, RETURN, ORDER BY, CREATE, MERGE and SET expressions.

Constant expressions in a TQL query

xGT and TQL support expressing numerical and string constants directly in a query. TQL also provides mechanisms for expressing constraints based on date, time, datetime, and IP address types. The functions date(), time(), datetime() and ipaddress() let the TQL query represent constants of the corresponding types built from string expressions in an appropriate format.

The following are the formats that are supported for each constant type:

Examples of the use of these constant expressions are as follows:

WHERE a.date > date("2018-01-01")
WHERE a.time = time("16:00:00.000000")
WHERE b.datetime <> datetime("2017-12-31T00:00:00")
WHERE b.ipaddr = ipaddress("192.168.1.1")

Note that =, <>, >, <, >= and <= operators are supported for date, time, and datetime properties and constants. However, IP addresses only support = and <> comparisons.

Modifications, additions and deletions

These optional parts of a query allow for changes to happen to the graph as part of its execution.

Property modifications

Modifications involve changing the values of properties in specific vertex and/or edge instances that match the graph pattern declared in the query. Note that the properties themselves must have been declared as part of the respective frame creation.

The power of these property modifications is that the values to modify them to can be computed from the query itself. That is, the user can programmatically determine what to change the values to as part of the query. A simple example of this would be computing a duration for each edge, assuming that all of the edges have a start_time and end_time property:

MATCH ()-[e:EdgeFrame]->()
SET e.duration = e.end_time - e.start_time
RETURN count(*)

In this case, all edges belonging to the frame EdgeFrame are visited (there are no WHERE constraints) and duration is computed from two of the other properties on that edge. Note that the schema for EdgeFrame must already include a duration property (even if the values are null).

Topology additions

Additions to the graph can also be done as part of query execution. It is possible to add new vertices and/or edges as part of a query.

Additions of new vertices is simpler since it involves providing the name of the vertex frame to add the vertex to as well as at least enough values for the key properties of the vertex:

MATCH ()-[e:EdgeFrame]->()
CREATE (v:VertexFrame { id: e.source_id + 1000, data: "test" })
RETURN DISTINCT e.source_id

In this simple example a new vertex is created for each unique source endpoint of an edge in the EdgeFrame frame. The value of the vertex key is computed from the key of the source endpoint of the edge. We use the DISTINCT keyword to guarantee that the values of the endpoints are unique and thus no duplicate vertex creation is attempted (it's an error to try and create a vertex with the same key value as an existing vertex in the same frame).

In addition to supporting the CREATE keyword, TQL supports the MERGE keyword which indicates to the system that the vertex should be created if it does not exist or retrieved from the current data store if it does. As with the CREATE keyword the MERGE keyword requires the specification of at least the key properties for the merged vertex.

Adding an edge involves matching existing vertices to use as endpoints for the edge being created:

MATCH (a:VertexFrame)-(:EdgeFrame)->(b:VertexFrame)
WHERE a.id > 0 AND b <> a
CREATE (a)-[new_edge:EdgeFrame { float_count: -0.5, data: 10 }]->(a)
RETURN a, b

In this example, we create a self-edge from a to itself and add it to the edge frame EdgeFrame, for distinct pairs of vertices a and b, where a's identifier is positive and there is an edge already in EdgeFrame from a to b. Note that the specification of the properties of the newly created edge cannot include the values of key properties—those are derived from values of the key properties of the endpoints.

The endpoints of the created vertex can be any pair of vertices that is compatible with the declaration of the edge frame (that is they are of the right type for source and target of the edge frame). The only requirement is that the endpoints have to be matched as part of the query. For convenience, the direction of the edge can be specified in either sense: source-to-target or target-from-source, with the resulting created edge being the same.

While xGT does not currently support the creation of the vertex endpoints of an edge and the edge itself in the same query statement, combining MERGE on a vertex and CREATE on the edge can be used to achieve the same effect:

MATCH (a:VertexFrame)
WHERE a.id >= 0
MERGE (b:VertexFrame { id: -1, data: "negative_vertex" })
CREATE (b)-[new_edge:EdgeFrame { float_count: -0.5, data: 10 }]->(a)
RETURN count(*)

In this example, we match an existing vertex a with non-negative key property id, then create a vertex with the key id (if necessary), and finallly use that vertex to create a new edge from it to the originally-matched a vertex. Note that MERGE is only supported for vertices because multiple edges can exist for the same key property values. It would be ambiguous if more than one edge matched.

Topology deletions

Deletions from the graph can also be done as part of query execution. It is possible to delete vertices and/or edges as part of a query.

Deleting edges is a simpler process than deleting vertices, because it is localized to the affected edge frame:

MATCH ()-[e:EdgeFrame]->()
WHERE e.sid = 0 AND e.tid = -1
DELETE e
RETURN e

In this case, the matched edges are removed from the edge frame as part of the query execution. Note that it is still possible to return values from the deleted edges since deletion occurs after matches have already been recorded (as in this example).

Deleting a vertex from a vertex frame is a more involved process since there could still be edges incident on that vertex across a collection of different frames.

MATCH (a:VertexFrame)
WHERE a.id = 0
DETACH DELETE a
RETURN a

TQL uses DETACH DELETE (as opposed to just DELETE) to delete a vertex and all adjacent edges. In this case, even if just the vertex frame VertexFrame is specified in the query, xGT will have to look into all edge frames where VertexFrame is a source or a target and delete edges from them that are incident on the matching vertex a.

Visibility of changes to the graph

Property modifications, additions and deletions from the graph are not immediately visible in the same query in which they are executing. They are applied to the graph and all its frames as part of the final steps in query execution after the pattern matching and query results have been recorded.

They are fully available in subsequent queries. More information is provided in the document titled Transactions in xGT.

What is an answer to a TQL query?

Now that we have discussed how to describe a structure as well as property constraints on the graph elements, we can define what is an answer to a TQL query:

Describing the answer to a TQL query

The RETURN keyword in a TQL query describes what the answer set should contain for each match. Each answer will be recorded as a row in the results table, and the sequence of expressions in the RETURN clause will each be returned as a column.

The returned expressions can be any combination of properties from bound variables in the query, constants, TQL operators, and function results. Each column can also be renamed using an alias name. A convenient shorthand for returning all properties in a graph element is to list its bound variable as part of the RETURN clause.

Examples of RETURN clauses are:

RETURN a.name, (a.age + 10) AS ageplus, indegree(a) AS indeg
RETURN a, b, c, d

Modifying how the answer set is reported

As we have described in the previous sections, the result of a TQL query is a table with one row per answer in the answer set. By default, xGT inserts resulting rows into this table in the order that it finds them, which can be arbitrary and vary from execution to execution.

In many cases, this is satisfactory since the content of the answer set may be more important than the way it is represented in the results table. In other cases, a different representation of the results table is required.

TQL supports the following solution modifiers to the answer set as represented in the results table:

Solution modifiers are very useful when used in combinations, for example the combination of ORDER BY with LIMIT is very useful when doing initial exploration of a data set, since it reports the top k rows according to some sorting criteria.

Aggregation functions

In addition to sorting or reporting unique rows, sometimes it is very useful to aggregate the results of a query into a smaller set of elements. TQL's aggregation functions provide a mechanism to do that.

As part of the expressions in the RETURN clause, a query can specify that instead of reporting each element in the answer set, they should be aggregated together. For example, the aggregation function sum() can add together all instances of a numerical expression provided as its argument. The aggregation function count(*) reports the size of the answer set, instead of reporting each individual element in it.

The following aggregation functions are supported by TQL:

Examples of their usage are as follows:

RETURN count(*) AS total
RETURN avg(a.age)
RETURN sum(p.sale_price - p.production_cost) AS profit
RETURN max(a.dob) AS youngest, min(a.dob) AS oldest

Grouping results together by keys

While TQL and Cypher do not have an explicit grouping operation like some other query languages, there is a common idiom used to express aggregation over groups of elements with the same values.

The combination of non-aggregated return expressions alongside aggregation functions produces results in applying the aggregation function to groups of answers with identical values for the non-aggregated values. For example:

RETURN a.id, count(*)

In this case, each group in the answer set will be identified by the value of a.id and instead of reporting each element in the group, the size of the group (count(*)) is reported. Note that the scope of the aggregation function is restricted to the elements of each group.

The number of expressions in the key and the number of aggregation functions is not restricted by TQL. Examples are as follows:

RETURN a.firstname, a.lastname, count(*)
RETURN b.year_of_birth, avg(b.salary), min(b.salary), max(b.salary)

Specifying the results table

The answer to a query is returned in the form of a table frame. TQL requires specifying the name of the table frame in which to return the query results using the INTO keyword. Here is a simple example that filters the edges in the edge frame EdgeFrame and returns the results into the table frame QueryResults:

MATCH ()-[e:EdgeFrame]->()
WHERE e.sid = 0 AND e.tid = -1
RETURN e
INTO QueryResults

If the table named in the INTO clause doesn't already exist, it will be created and populated with the results of the query. If the table exists and the table schema types match the types of the implied schema for the RETURN clause, the query results will be appended to the already existing table. If the table exists and the table schema types don't match the types of the implied schema for the RETURN clause, xGT will report an error.

Querying over a table frame

TQL supports the additional functionality of querying against a table. This can be used, for instance, to do some additional exploration of a result table acquired from a graph query.

A table query is performed by using the MATCH keyword with just a single element (no edges) using the syntax for vertices: parentheses (). Any of the constraints or aggregation functions can be performed in such a query. For example we could look at the result of a previous query and count all results of persons over a certain age:

MATCH (row:Result)
WHERE row.age > 50
RETURN count(*)