Thinking about Graph Data
Introduction
xGT is a tool for reading in massive amounts of data into RAM for performing fast pattern search operations. The best data for this analytic approach is where there are relationships between data objects described in the data (i.e. linked data). The classic example of this is a social network graph where people have relationships with each other represented in the data such as friend-of, knows, and family.
We start with the assumption you have data and want to analyze it with xGT.
Let's walk through how to approach this activity, how to build a mental model
of your data, how to describe that model using the Trovares xgt
Python library,
and how to get xGT to help you understand what it is in your data.
We will begin with the data and follow a simple example to illustrate this process. Consider that you have data in two separate comma-separated-values (CSV) files:
PersonID | Name |
---|---|
123456789 | Manny |
123454321 | Bob |
987654321 | Frank |
987656789 | Alice |
PersonID | PersonID | StartDate | EndDate |
---|---|---|---|
123456789 | 987654321 | 20150103 | 20170414 |
123454321 | 987654321 | 20160402 | 20170414 |
987656789 | 987654321 | 20160707 | 20170414 |
123456789 | 987656789 | 20170415 | - |
123454321 | 987656789 | 20170415 | - |
987654321 | 987656789 | 20170415 | - |
Graph Model
These two data sources correspond nicely to components of a graph. The first is a collection of information about an object (in this case, a person), which can be represented as a vertex in the graph. A vertex is a mathematical name for the "bubble" in a graph drawing.
The second is a collection of information about relationships between objects, which can be represented as an edge in the graph. An edge is a mathematical name for the line that connects two vertices. We usually consider the edge to have a direction, meaning it goes from one specific vertex to another specific vertex; and we indicate that with an arrow-head point to the to vertex. We can call the two vertices in this relations source and target of the edge.
Graph Image
Setting up xGT with Graph Model
We can use the xgt
Python library to build a graph model of the CSV data above.
First, we construct a Graph
object, which acts as a container for other components.
Next, we describe a Vertex
object which will hold the employee data and an Edge
object which will hold the data describing who reports to whom.
By describing the PersonID
property as a key of a vertex and specifying that
the source and target of an edge are also PersonID
properties, we are giving
xgt
the information required to link these properties into a single, connected
graph. This lets us hop along edges from vertex to vertex quickly and efficiently.
Details for the commands below are described in our
xgt
reference manual.
graph_definition = xgt.Graph('Company')
vtx_def = xgt.Vertex(name = 'Employee',
schema = [('PersonID', xgt.INT),
('Name', xgt.TEXT)],
key = ['PersonID'])
graph_definition.add(vtx_def)
edge_def = xgt.Edge(name = 'ReportsTo',
schema = [('EmpID', xgt.INT),
('BossID', xgt.INT),
('StartDate', xgt.DATE),
('EndDate', xgt.DATE)],
source = [('EmpID', vtx_def.key.PersonID)],
target = [('BossID', vtx_def.key.PersonID)])
graph_definition.add(edge_def)
Understanding the Vertices
As mentioned earlier, the Employee data is represented in our graph model as a
vertex.
A Vertex component of the graph is uniquely identified by some set of columns
from the vertex schema.
In our case, the PersonID
by itself is enough to uniquely identify a Person
object, so that is our vertex key.
All of the columns from the vertex schema that are not used as key columns are
properties of the object.
In our case, the Person vertex has a Name
attribute for each Person vertex.
Understanding the Edges
To connect two vertices (employees) with a reports to
relationship we
have an edge.
The data associated with each edge comes from the edge's schema.
In our case, it comes from the ReportsTo
schema.
Note that there are no columns of the ReportsTo
schema that are directly
construed as vertices.
We can guess that the EmpID
and BossID
look like identifiers for vertices,
but that is not provided by the structure of the schema.
We must establish this explicitly using the source
and target
parameters of
the Edge() function.
Note that there are no PersonID
columns in the edge schema, and even if there
were, it is possible that the names are coincidental. As the curators of the data,
we happen to know that EmpID
and BossID
are intended to refer to the same
concept. We must inform xGT of this relationship explicitly by using the source
and target
parameters of the Edge
class.
The direction of our ReportsTo
relationship is from employee to boss, so the
EmpID
column is used as the source
parameter of the edge, and the
BossID
column is used as the target
parameter of the edge.
Data loading
Normally, having described the schema of the graph components, our next step would be to actually fill those components with data from our tables. For brevity, we'll skip over this step, but you can learn about the various mechanisms xGT provides in our data management documentation Let's pretend we've accomplished this and skip straight to searching for patterns in our graph.
Looking for interesting patterns
If you have looked over our sample data you may have guessed that one
interesting patterns is finding a pattern of employee X
that reports to
boss Y
at some point in time where the roles are later reversed.
That is, employee Y
reports to Boss X
at a date that is later.
You can imagine that spotting such a pattern is easy in a few instances, but if you had 100,000 employees to look through, it would be very challenging for a person to notice these kinds of patterns. If your data consisted of employee data from many companies, it is easy to imagine getting to hundreds of millions of graph edges.
So let's see how to convert our image of a pattern into TQL to have xGT perform an automated search for all patterns.
Describing one relationship
To describe the first X --> Y
relationship---where -->
is used to indicate
"reports to", which can also be understood as an edge of the graph---we
begin to formulate a MATCH
statement as follows:
MATCH (emp:Employee)-[edge1:ReportsTo]->(boss:Employee)
Note that in TQL the vertices must be given a vertex type (in our case
this is Employee
for both the employee and for the boss) because
the xGT data model supports multiple vertex
types and multiple edge types in a graph.
For a similar rationale we must supply an edge type for the connecting
edge (in our case it is ReportsTo
).
But this MATCH
statement is incomplete. For example, xGT is not
told what to do whenever it finds a match.
To formulate the simplest query that xGT can run we could do this:
MATCH (emp:Employee)-[edge1:ReportsTo]->(boss:Employee)
RETURN emp.PersonID, edge1.StartDate, edge1.EndDate, boss.PersonID AS boss
Be careful with this! It produces an exact copy of the ReportsToTable
,
which may be much larger than you want to deal with.
Describing the second relationship
The later employee reporting structure that we want to find can be thought
of as a two-path (two contiguous edges) through the graph.
At a high level of abstraction, it comes down to: X --> Y --> X
.
We also need to add the constraint that the end date of the first edge
comes on or before the start date of the second edge.
We begin by showing how to describe a two-path:
MATCH (emp:Employee)-[edge1:ReportsTo]->(boss:Employee)-[edge2:ReportsTo]->(emp)
RETURN emp.PersonID, edge1.StartDate AS Start1, edge1.EndDate AS End1,
boss.PersonID AS boss,
edge2.StartDate AS Start2, edge2.EndData AS End2
To add the constraint about the second edge coming on or after the first edge,
we add a WHERE
clause:
MATCH (emp:Employee)-[edge1:ReportsTo]->(boss:Employee)-[edge2:ReportsTo]->(emp)
WHERE edge1.EndDate <= edge2.StartDate
RETURN emp.PersonID AS Employee1ID, boss.PersonID AS Employee2ID,
edge1.StartDate AS Start1, edge1.EndDate AS End1,
edge2.StartDate AS Start2, edge2.EndData AS End2
It is common that queries include constraints in the form of the WHERE
clause.
We use the term query to refer to a MATCH
statement such as the one above.
Understanding the query result
There are really two graphs involved in a query: the large data graph and
the smaller query graph.
The query graph is the graph structure (vertices and edges) described in the
MATCH
statement without the constraints of the WHERE
clause.
When xGT finds a matching pattern in the large data graph---where
"matching" means that the graph structure is aligned and that the attributes
attached to the subgraph of the large data graph being matched satisfies
the constraints of the WHERE
clause---a row is added to a result table.
The result table will have columns that correspond to the values/names on
the RETURN
clause.
If the return clause field has an AS name
component, that will be the name
of the column in the result table.
Otherwise, xGT will select a unique name for the result table column.
The data type of each column is determined from the data type of the return
value.
Note that we should first drop the result table in case it already exists.
DROP TABLE Result;
MATCH (emp:Employee)-[edge1:ReportsTo]->(boss:Employee)-[edge2:ReportsTo]->(emp)
WHERE edge1.EndDate <= edge2.StartDate
RETURN emp.PersonID AS Employee1ID, boss.PersonID AS Employee2ID,
edge1.StartDate AS Start1, edge1.EndDate AS End1,
edge2.StartDate AS Start2, edge2.EndData AS End2
INTO TABLE Result;
Exploring the query result
Once a query finishes, xGT will contain a table that is populated with data that you described in the query.
Concluding remarks
You have now been given a mental model of working with graphs and graph data
and how some snippets of TQL relate to the mental model.
To begin working with real data inside xGT, it is essential to use python
scripting and Trovares' xgt
Python library.