Introduction to the Trovares Query Language (TQL)
At this point you have loaded data into the system and defined vertex and edge types with their associated schemas in order to represent your data as a graph. This document discusses how to phrase and ask interesting queries on graph data stored in the xGT system.
The Trovares Query Language (TQL) uses a subset of the Cypher language to express queries. The supported subset of Cypher enables powerful and expressive queries, while taking advantage of xGT's strongly typed graph elements (fixed schemas) in order to achieve very high performance and scalability.
A TQL query consists of four components (with some of them optional):
- Required: structure (description of the "shape" of the pattern).
- Optional: constraints on the graph elements' properties.
- Required: description of the answer set.
- Optional: solution modifiers (including ordering & limiting results).
TQL's Cypher subset has specific syntax for each component of a query:
MATCH <structure>
WHERE <optional constraints on properties>
RETURN <description of the answer set>
<optional solution modifiers>
INTO <results table name>
Note that if there are no constraints supplied, the WHERE
keyword should
not be present in the query.
We describe each of these components in detail.
Structure
The basic concept behind querying graph data in xGT is the representation of what to search for in the graph. A structure is a template of what to match in the large data graph stored in xGT's system memory.
Using a structure description, xGT can then match that template to all components of the large data graph and extract those matching graph elements as part of the answer set to the query.
The structure, more precisely, describes the categories and topology of the graph elements that are part of the answer.
A category of a graph element is it membership in a particular vertex or edge type. Each instance of a vertex or edge in xGT belongs to one and only one type.
The topology of the graph elements describes how vertices and edges connect to each other. The topology is restricted by the types of the vertices and edges. The declaration of an edge type includes the types of vertices it connects (one type for the source of the edge and one for the target).
TQL's Cypher subset expresses the structure as a syntactic construct
in the MATCH
statement. The structure is described as a collection
of connected vertices and edges. Each vertex is represented by a ()
parentheses pair, while each edge is represented by a []
bracket
pair.
Inside each vertex or edge component, there are two elements that can
be provided. The first one is an annotation to indicate to which
type (category) the vertex or edge must belong. This is expressed
by the use of a colon followed by a TQL type name: (:Person)
or
[:WorksFor]
. In these examples, the vertex must belong to the
Person type and the edge must be of WorksFor type. An optional
variable name can be added to the vertex or edge to bind that graph
element to a particular name. Consider the examples of (a:Person)
and [b:WorksFor]
. In these cases the graph elements that match can
be referred to in the rest of the query by the names a
and b
. In
particular, the use of variables is required in order to specify
property constraints and results in the answer set.
In xGT, edges are directed and as such the structure must
indicate the edge direction desired in the resulting answer. Vertex
components of the pattern used as sources are indicated by the use
of a -
character connecting the vertex to the edge:
(:Person)-[:WorksFor]
. Vertex components of the pattern used as
targets are indicated by the use of the ->
character pair
connecting the edge to the vertex:
(:Person)-[:WorksFor]->(:Company)
. This example describes a pattern
that matches persons that work for a company. The WorksFor
edge
goes from a person to the company that person works for. The edge is
outgoing from the person vertex and it is incoming to the company
vertex. It is also possible to write the pattern in the reverse
direction: (:Company)<-[:WorksFor]-(:Person)
. In this case too, the
edge is incoming to the company and outgoing from the person. Using
the two different edge directions is a convenient facility when
composing larger patterns with multiple edge and vertex types.
Putting it all together
A structure in TQL is composed of one or more sequences of vertex-edge-vertex subsequences, where the types of the components must be compatible. An example of a longer pattern is as follows:
MATCH (:Company)<-[:WorksFor]-(a:Person)-[:FriendOf]->[b:Person]
In this example, we make use of several of TQL's pattern facilities, including forward and reverse directions in the edges, multiple vertex types (Person and Company) and multiple edge types (WorksFor and FriendOf). We also bind the two people in the query to variables "a" and "b".
Non-linear patterns
So far, we have discussed linear patterns in the sense that all vertices and edges in the pattern follow a sequence in the graph's topology. Describing more complex patterns is done using the "," (comma) operator. The best way to think about these patterns is that they "branch off" from a linear pattern into another part of the graph.
For example:
MATCH (:Person)-[:WorksFor]->(c:Company)-[:IsLocated]->(:City),
(c)-[:CompetesAgainst]->(:Company)
In this case, the first linear pattern describes people working for
companies of interest and where they are located. The second linear
pattern is connected to the first one by the intermediate Company
vertex. The connection of the two patterns is established by the use
of the bound variable c
in the second linear pattern. The use of a
previously bound variable is required as part of the first vertex in
the second and subsequent linear patterns, in order to establish the
connectivity of the patterns.
More complex structure patterns can consist of multiple connected linear patterns in this fashion.
Constraints on properties
We have discussed how to express the topological part of a TQL query using structures. We now discuss how to express constraints on the data stored in the components of the graph.
Recall that each graph component in xGT has an associated schema with named properties. Each property in the schema has an associated data type: e.g. integer numbers, strings, floating-point numbers, dates, etc.
Constraints on the properties of the graph elements are described in
the WHERE
clause of a TQL MATCH
query. The WHERE
clause
consists of expressions involving the properties of the graph
elements, combined with constants (of the appropriate data type) and
commonly used comparison, arithmetic and boolean operators.
The combination of the structure and constraints in a query is called the graph pattern in TQL.
Properties are referred to by using the "." (dot) operator in between
the name of a bound variable and the name of a property: a.name
would access the property named "name" of the graph element bound to
the variable "a".
Property expressions can be combined with other property expressions and constants using TQL's Cypher operator subset.
Supported operators are as follows:
- Arithmetic:
+
,-
,*
,/
,%
(modulus), (unary)-
- Boolean:
AND
,OR
,NOT
- Comparison:
=
(equality),<>
(difference),<
,>
,<=
,>=
,IS NULL
,IS NOT NULL
- String:
STARTS WITH
,ENDS WITH
,CONTAINS
- Constant collection:
IN
Constants are supported for the following data types:
- Integer numbers
- Floating-point numbers (32 bit)
- Boolean true and false
- String constants (surrounded by double quotes)
- Null constant (
NULL
) - Dates
- Times
- Datetimes
- IP addresses
Parentheses ()
can be used to indicate precedence when dealing
with nested expressions.
Examples of WHERE
constraints are as follows:
WHERE p.name = "John" AND p.age < 40
WHERE c.value > 10.0 OR c.value < 2.5
WHERE (p.name STARTS WITH "D") AND (p.address IS NOT NULL)
WHERE (c.value > 10.0) OR (c.value < d.value)
Functions in constraint expressions
TQL provides a set of functions that can be used in constraint expressions. These functions include degree operations on vertices and a convenience form of expressing that vertices must be unique.
The degree functions can take one or two arguments. The first argument is always a bound variable to a vertex component in the structure. The optional second argument is the name of an edge type in xGT.
Degree computations without the optional edge type name are absolute, in the sense that the degree of the bound vertex is computed across all edge types in xGT. When using the optional edge type name parameter, the degree computation becomes relative to that edge type. That is, the degree of the vertex is computed only for edges of the named type.
TQL supports the following degree functions:
indegree()
: returns an integer value with the number of incoming edges to the specified bound vertex variable.outdegree()
: return an integer value with the number of outgoing edges from the specified bound vertex variable.
Examples are as follows:
WHERE indegree(a) = 10
WHERE outdegree(b) > 10
WHERE (outdegree(c, FriendOf) + outdegree(c, WorksFor)) < 5
By default xGT and TQL do not impose restrictions on the identity of the vertices and edges in a structure. In particular, cycles are allowed (graph paths from one vertex to the same vertex) and will be reported if present in the data. There are cases where the identity of the vertices in a query is not important, but there are also cases in which at least some of the vertices must be unique: that is they must be different from each other.
It is simple to express vertex differences with just a couple of
vertices: a <> b
. But if we have more vertices say a
, b
, c
and d
then expressing all the difference constraints becomes tedious
and error-prone: a <> b AND a <> c AND a <> d AND b <> c AND b <> d
...
. For this reason, TQL provides a shortcut to this common case.
The function unique_vertices()
can be added as part of a WHERE
clause with its arguments being the bound variables for vertices that
must be different from each other. The xGT system automatically
generates all difference constraints and appends them to the
user-provided constraints in the query.
Constant expressions in a TQL query
xGT and TQL support expressing constants of numerical and string types directly in a query. Given that xGT supports properties of date, time, datetime and IPADDRESS types, TQL must provide a mechanism to express constraints based on constants of those types.
The functions date()
, time()
, datetime()
and ipaddress()
let
the TQL query represent constants of the corresponding types built
from string expressions in an appropriate format.
The following are the formats that are supported for each constant type:
date()
: The string must be in the format YYYY-MM-DD.time()
: The string must be in the format HH:MM:SS.uuuu where the precision is supported up to microseconds.datetime()
: The string must be in the format YYYY-MM-DDTHH:MM:SS, where the date and the time are separated by a fixed "T character.ipaddress()
: The string must be in the format IP0.IP1.IP2.IP3 where each position of the IP address is limited to a number between 0 and 255.
Examples of the use of these constant expressions are as follows:
WHERE a.date > date("2018-01-01")
WHERE a.time = time("16:00:00.0000")
WHERE b.datetime <> datetime("2017-12-31T00:00:00")
WHERE b.ipaddr = ipaddress("192.168.1.1")
Note that >
, <
, >=
and <=
operators are supported for date,
time and datetime properties and constants but not for IP addresses.
IP addresses only support =
and <>
comparisons.
What is an answer to a TQL query?
Now that we have discussed how to describe a structure as well as property constraints on the graph elements, we can define what is an answer to a TQL query:
-
An match to a TQL query is a collection of graph elements that satisfy both the topological and type constraints in the structure as well as the property constraints in the
WHERE
clause. The graph elements must be connected in the manner specified by the structure. -
An answer is described as a sequence of information gathered from a match. This can be thought of as a row in a table.
-
An answer set to a TQL query is the set of all answers found in the graph data for a given query.
Describing the answer to a TQL query
The RETURN
clause in a TQL MATCH
query describes what the answer
set should contain for each match. The RETURN
clause lists a sequence of
expressions that will be returned in the form of rows in a table.
Each expression in the sequence is mapped consecutively to a column of
the results table.
The returned expressions can be any combination of properties from
bound variables in the query, constants, TQL operators and function
results. Each column can be named using an alias name. A convenient
shorthand for returning all properties in a graph element is to list
its bound variable as part of the RETURN
clause.
Examples of RETURN
clauses are:
RETURN a.name, (a.age + 10) AS ageplus, indegree(a) AS indeg
RETURN a, b, c, d
Modifying how the answer set is reported
As we have described in the previous sections, the result of a TQL query is a table with one row per answer in the answer set. By default, xGT inserts resulting rows into this table in the order that it finds them, which can be arbitrary and vary from execution to execution due to performance considerations.
In many cases, this is satisfactory since the content of the answer set is more important than the way it is represented in the results table. In other cases, a preferred representation of the results table is required.
TQL supports the following solution modifiers to the answer set as represented in the results table:
DISTINCT
: applies to the entire answer set and requests that all rows inserted into the table are unique across the set.ORDER BY
: requests a sorted representation of the answer set, whereas the rows in the table are sorted by a key. The key can be a combination of multiple expressions and need not be a part of the expressions in theRETURN
clause. Each expression in the order by clause can indicate whether to sort in ascending or descending (DESC
) order.LIMIT
: requests that the xGT engine produce only the top k rows of the answer set.SKIP
: requests that the xGT engine skip over the first k rows of the answer set.
Solution modifiers are very useful when used in combinations, for
example the combination of ORDER BY
with LIMIT
is very useful when
doing initial exploration of a data set, since it reports the top k
rows according to some sorting criteria.
Aggregation functions
In addition to sorting or reporting unique rows, sometimes it is very useful to aggregate the results of a query into a smaller set of elements. TQL's aggregation functions provide a mechanism to do that.
As part of the expressions in the RETURN
clause, a query can specify
that instead of reporting each element in the answer set, they should
be aggregated together. For example, the aggregation function SUM()
can add together all instances of a numerical expression provided as
its argument. The aggregation function COUNT(*)
reports the size of
the answer set, instead of reporting each individual element in it.
The following aggregation functions are supported by TQL:
COUNT(*)
: reports the number of elements in the answer set.SUM()
: returns the sum of all values of a numerical expression in the answer set.MIN()
: returns the smallest of all values of an expression in the answer set.MAX()
: returns the largest of all values of an expression in the answer set.AVG()
: returns the numerical average of all values of an expression in the answer set.
Examples of their usage are as follows:
RETURN COUNT(*) AS total
RETURN AVG(a.age)
RETURN SUM(p.saleprice - p.productioncost) AS profit
RETURN MAX(a.dob) AS youngest, MIN(a.dob) AS oldest
Grouping results together by keys
While TQL and Cypher do not have an explicit GROUP BY clause or operation like other query languages, there is a common idiom used to express aggregation over groups of elements with the same key.
The combination of non-aggregated return expressions with aggregation functions results in effecting a grouping operation where the non-aggregated expressions become the keys of the group. For example:
RETURN a.id, COUNT(*)
In this case, each group in the answer set will be identified by the
value of a.id
and instead of reporting each element in the group,
the size of the group (COUNT(*)
) is reported. Note that the
scope of the aggregation function is restricted to the elements of
each group.
The number of expressions in the key and the number of aggregation functions is not restricted by TQL. Examples are as follows:
RETURN a.firstname, a.lastname, COUNT(*)
RETURN b.yearofbirth, AVG(b.salary), MIN(b.salary), MAX(b.salary)