2.8. TQL Structure

The basic concept behind querying graph data in xGT is the representation of what to search for in the graph. A structure is a template of what to match in the large data graph stored in xGT’s system memory. xGT uses this template to match elements of a large data graph and extract them into a results set.

More precisely, the structure describes the type and topology of the graph elements that are part of the answer.

The graph element’s membership in a particular vertex or edge frame defines its type. Each instance of a vertex or edge in xGT can belong to only one frame.

A vertex frame is a collection of vertices that have the same property names and types. We can think of these vertices as representing the same type of entity. For example, in an employment graph, all vertices representing a person may belong to a Person frame, while all vertices representing a company may belong to a Company frame. Similarly, an edge frame is a collection of edges that share the same property names and types.

The topology of the graph elements describes how vertices and edges connect to each other. The topology is restricted by the types of the vertices and edges. The declaration of an edge frame includes the frames of vertices it connects (one frame for the source of the edge and one for the target).

TQL’s Cypher subset expresses the structure as a syntactic construct in its MATCH statements. The structure is described as a collection of connected vertices and edges. Each vertex is represented by a variable enclosed in parentheses (), while each edge is represented by a variable surrounded by brackets [].

Inside each vertex or edge component, there are two elements that can be provided. The first one is an annotation to indicate to which frame (type) the vertex or edge must belong. This is expressed by the use of a colon followed by a qualified frame name: (:career__People) or [:career__WorksFor]. In these examples, the vertex must belong to the People frame and the edge must belong to the WorksFor frame. Note that in both cases the frames are stored in the career namespace. The second is an optional variable name that can be added to the vertex or edge to bind that graph element to a particular name. Consider the examples of (a:People) and [b:WorksFor]. In these cases the graph elements that match can be referred to in the rest of the query by the names a and b. If a variable is not given a name, it will force the specified structure but will remain anonymous. It cannot be used in value constraints, involved in graph cycles in the query, and its properties cannot be extracted in the results set.

In xGT, edges are directed, and the structure can indicate the edge direction desired in the resulting answer. Vertex components of the pattern used as sources are indicated by the use of a - character connecting the vertex to the edge: (:People)-[:WorksFor]. Vertex components of the pattern used as targets are indicated by the use of the -> character pair connecting the edge to the vertex: (:People)-[:WorksFor]->(:Companies). This example describes a pattern that matches persons that work for a company. The WorksFor edge goes from a person to the company that person works for. The edge is said to be outgoing from the person vertex and incoming to the company vertex. It is also possible to write the same pattern in the reverse direction: (:Companies)<-[:WorksFor]-(:People). Despite appearing reversed on paper, the query described is identical. Using the two different edge directions is a convenient facility when composing larger patterns with multiple edge and vertex frames.

Edges can be specified as undirected when the source and target vertex frames for a specific edge frame are the same (this is a current implementation restriction). An example would be (a:People)-[:FriendOf]-(b:People). In this case, the pattern is satisfied by either having and edge from a to b or b to a in the “FriendOf” edge frame.

The structure becomes optional when queries just want to modify a few components of the graph. It is possible to create new edges and vertices without specifying a structure. However, the properties of the newly created edges and vertices would be restricted to expressions over constants. More information about graph element creation is described in Topology Additions.

2.8.1. Putting It All Together

A structure in TQL is composed of one or more sequences of vertex-edge-vertex subsequences, where the types of the components must be compatible. An example of a longer pattern is as follows:

MATCH (:Companies)<-[:WorksFor]-(a:People)-[:FriendOf]->(b:People)

In this example, we make use of several of TQL’s pattern facilities, including forward and reverse directions in the edges, multiple vertex frames (People and Companies) and multiple edge frames (WorksFor and FriendOf). We also bind the two people in the query to variables a and b.

2.8.2. Non-linear Patterns

So far, we have discussed linear patterns in the sense that all vertices and edges in the pattern follow a sequence in the graph’s topology. Describing more complex patterns is done using the , (comma) operator. The best way to think about these patterns is that they “branch off” from a linear pattern into another part of the graph.

For example:

MATCH (:People)-[:WorksFor]->(c:Companies)-[:IsLocated]->(:Cities),
      (c)-[:CompetesAgainst]->(:Companies)

In this case, the first linear pattern describes people working for companies of interest and where they are located. The second linear pattern is connected to the first one by the intermediate company vertex c. More complex structure patterns can consist of multiple connected linear patterns in this fashion.

xGT has a constraint on non-linear patterns that doesn’t exist in Cypher. A subsequent linear path after a comma operator must have a connecting anchor vertex occurring somewhere in the path. The following example is valid in Cypher, but xGT cannot process it currently since the only named vertex b in the second path is not connected to the first path. xGT requires that all paths in a non-linear pattern be connected to each other via at least one anchor vertex. In graph theoretical terms, all paths must form a “connected component”. No path can be outside the connected component.

MATCH (:People)-[:WorksFor]->(c:Companies)-[:IsLocated]->(:Cities),
      (:Companies)<-[:CompetesAgainst]-(c)

TQL also supports multiple MATCH statements in the same query section. The previous example can be written as follows:

MATCH (:People)-[:WorksFor]->(c:Companies)-[:IsLocated]->(:Cities)
MATCH (:Companies)<-[:CompetesAgainst]-(c)

Multiple MATCH statements correspond to the same semantics as non-linear paths separated by the comma operator and must obey the same rules.

2.8.3. Variable-length Edge Traversal

TQL supports specifying that a particular edge in the pattern could be traversed more than once. This is known as a variable-length edge. A variable-length edge is specified by indicating the minimum and maximum number of edges that can match this element in the path before the query moves on to the rest of the path.

[:CompetesAgainst *1..5]

In this example, [:CompetesAgainst] can involve from one up to five edge traversals before the query moves on to the rest of the path. The lower bound of the traversal must be strictly less than the upper bound. The lower bound must also be greater or equal to one.

Of particular importance is that the edge frame CompetesAgainst maps vertices in frame Companies to vertices in the same frame. That is the source and target vertex frames are the same. This is a requirement of xGT’s support for variable-length edge traversal.

Variables cannot be bound to a variable-length edge. The following is invalid in TQL: [c:CompetesAgainst *1..2]. The reason for not allowing bound variables is that it would be ambiguous to identify which edge does the variable c in the example refer to. It could be the first edge, the second edge or potentially either in different contexts.

During execution of the variable-length edge, the target vertex instance of the traversal is mapped back to the source of the next traversal, thus requiring the vertices to be part of the same vertex frame. This is a current limitation of xGT’s support for variable-length edges and not part of the Cypher language.

If multiple traversals of a variable-length edge result in continuing the rest of the query’s execution, then a variable-length edge could result in several matches to the query. In the example above it could happen that a single traversal (one edge) results in a match to the query, in addition to a traversal of four edges results in another match to the query. Answers can be produced by traversals of any of the lengths indicated by the variable edge’s bounds.

The minimum lower bound for a variable-length traversal is one and the upper bound must be specified. Unbounded traversals are not supported by xGT.

Undirected variable-length edge traversal is supported when the source and target vertex frames are the same. An example of this is the following: (a:People)-[:FriendOf *1..6]-(b:People). In this case, a and b are part of the answer if they have friends in common using from one to six indirect friendships (degrees of separation).

2.8.4. Multiple Edge Frame Traversal

TQL supports specifying multiple edge frames that can be traversed in a pattern. This is specified by chaining together edge frame names with the pipe symbol: | .

MATCH (:Companies)-[:CompetesAgainst | :PartneredWith]->(:Companies)

In the above example the MATCH statement will find patterns with relationships of either the [:CompetesAgainst] or [:PartneredWith] frames. The query may return matches from traversals along either or both of the given edge frames. This can also be used in combination with variable-length edge traversal.

MATCH (:People)-[:WorksFor | :CompetesAgainst | :PartneredWith *1..2]->(:Companies)

In the above example the MATCH statement will find one or two length edge traversals where the source vertex frame is (People) and the final target vertex frame is (:Companies). The intermediary edge frames can belong to any frame specified in the MATCH statement. These intermediary edge frames are dynamically selected at runtime. For the above example this amounts to only three possible orderings of edge frames:

(:People)-[:WorksFor]->(:Companies)
(:People)-[:WorksFor]->(:Companies)-[:CompetesAgainst]->(:Companies)
(:People)-[:WorksFor]->(:Companies)-[:PartneredWith]->(:Companies)

Multiple edge frame traversal can be done with any number of edge frames and any upper or lower bound. It should be noted that this could result in a very large number of valid traversals. If the MATCH statement is written such that there are no valid traversals, there will be no matches returned.

Variables can be bound to multiple edge traversal steps when there are NO variable-length bounds associated with the step. Consider this example:

MATCH (:Companies)-[b:CompetesAgainst | :PartneredWith]->(:Companies)

In this case, the variable b is bound against the matching edges in the CompeteAgainst or PartneredWith frames.

Properties can be extracted from the bound variable b as described in section Property Extraction from Multiple Candidate Frames.

2.8.5. Path Variable Capture

TQL supports capturing the components of a traversed path in a MATCH query. Both traversed edges and vertices are captured and stored in the named path variable as shown below:

MATCH p = (:People)-[:WorksFor]->(c:Companies)-[:IsLocated]->(:Cities)

In this example, the path variable p will store the specific People vertex, WorksFor edge, Companies vertex, IsLocated edge and final Cities vertex.

Path variables are restricted to capturing only individual linear patterns in a MATCH query. If there are separate linear patterns, they can be captured into different path variables.

MATCH p = (:People)-[:WorksFor]->(c:Companies),
      q = (c)-[:CompetesAgainst]->(:Companies)

In this example, p will capture the first linear pattern and q captures the second linear pattern going from the same Companies vertex, traversing a CompetesAgainst edge and arriving at possibly another Companies vertex.

A more interesting use case of path variables is the traversal of variable-length edges and of multiple edge frames.

MATCH p = (:Companies)-[:CompetesAgainst *1..5]->(:Companies)

In this case p stores the path from a starting company to an ending company which can contain from one up to five edges of competitors. This would match companies that compete directly or indirectly with one another with up to five “connections” of separation between them. For variable-length edges, the path variable will also store the internal vertices that are used to fully traverse the path, not only the starting and ending vertices.

The most general case for path variable capture is variable-length edge traversal with multiple edge frames involved.

MATCH p = (:People)-[:WorksFor | :CompetesAgainst | :PartneredWith *1..2]->(:Companies)

In this example, p stores the full matched path containing the starting People vertex, any of the enumerated edge frames, internal vertices traversed of appropriate types and the final Companies vertex. Note that for different matching patterns, the number of internal vertices and edges used in the variable-length traversal can vary according to the bounds of the traversal.

Path variable can capture undirected edges specified in the pattern, with the restriction that undirected edge matching can only be used for edge frames in which the source and target vertex frame are the same.

MATCH p = (a:People)-[:FriendOf *1..6]-(b:People)

In this case, the path p stores paths that go from a to b or b to a using up to six edges.