Transactions in xGT

In order to achieve performance and scalability on very large data, xGT's core engine is a highly parallel piece of software. For this reason, it is important to understand how xGT keeps data safe and consistent no matter how many complex analytics are modifying or augmenting the underlying data store.

The natural outcome of a naive system allowing simultaneous, unchecked data modification and retrieval in a concurrent system is non-determinism: the results of a query would depend on the particular order in which it was executed and may change or even become self-contradictory. Clearly, this outcome is not what a data scientist needs or expects.

xGT prevents this scenario by building around a formal set of rules which govern how data reads and writes interact, called a consistency model. To avoid excessive complexity, xGT wraps these rules up into a familiar concurrent interface: transactions. This document explains how transactions are used and provides several example scenarios of how xGT behaves in the presence of concurrent data modification and query execution.

Transaction basics

Transactions have a lifecycle that includes three states: execution and either commit or rollback. When a transaction begins, the system creates a private space for it to make changes. Like a parallel universe, modifications performed are not directly applied to globally visible data, instead being accumulated into this private space for later. When execution is complete, xGT attempts to apply the collected changes to the global state. If the modifications cannot be applied safely (for instance, if another transaction has made conflicting changes to the same data), then the transaction is "rolled back" and all the changes in the private space are discarded. The data will be left as it was originally. Alternatively, if the modifications are successfully applied, then the transaction is said to have "committed" and these changes will be visible to every other transaction which commits from that point forwards.

In many respects, transactions in xGT are similar to those found in databases in that they provide certain guarantees:

It is worth reiterating that xGT is not a database, and a shrewd observer will note that xGT's transactions do not guarantee durability (the fourth promise in the ACID model used by most databases). xGT is an in-memory analytics engine and delivers dramatically higher performance on complex queries by avoiding disk access. Some users choose to complement xGT with a slower, traditional database system or other persistent storage solution to archive their data.

Using transactions in Python

Most of the time, users won't need to worry about transactions at all. The xgt library automatically wraps every method call in a transaction and hands it off to the xGT server. If no data conflict occurs, the method returns and life goes on. By using transactions throughout, xGT guarantees that all of a user's interactions are protected from concurrent modification.

If something does go wrong and a transaction needs to be rolled back, xGT will signal this behavior to the user by raising an XgtTransactionError and setting the status of a job to rollback.

Two cases are worth discussing in more detail: getting data to and from disk, and modifying data with TQL queries.

Data loading and saving

xGT provides the load() and save() method calls in our Python API to enable the user's data to be moved into or out of xGT. Calls to either of these functions can take a significant amount of time, particularly for very large files or collections of files with billions of graph entities stored in them.

Both load() and save() calls are frame-oriented, which means that only one particular frame is being affected by the call. Other frames which are not being loaded or saved are unaffected and remain available to the user for queries (or for loading and saving as well). This makes it more likely for a load() or save() call to avoid having to roll back.

The save() call is simpler to understand: the frame being saved still remains available for queries reading from it at all times during the execution of the save(). Thus, it is also possible to have more than one simultaneous save() call going on at the same time for the same frame. Only queries that try to modify or augment the data of that frame (including load() calls to the same frame), will not commit while the save() call is being processed. This prevents inconsistent data from being written out to disk. Queries that would conflict with a save() call are prevented from starting until the call finishes. Once it completes, they can start as normal.

The execution of a load() call is a bit more involved since it involves both adding new data to an existing frame as well as updating other connected data. For instance, a load() call on a vertex frame will also involve updates to any edge frames that are using that vertex frame as an endpoint. A load() on an edge frame might also trigger the insertion of new vertices into a vertex frame. For these reasons, a load() call might conflict with several frames at the same time, blocking queries and other load or save operations which use those frames.

For both of these cases, remember that xGT will always guarantee that its data is left in a consistent, usable state. It is simply possible that a particular operation might need to be re-executed (if it is still valid) or re-thought (if some other operation has changed data on which the operation relied).

Any errors that occur during a load call (such as having a malformed line) will cause the job status to show as failed and cause the load command to throw an exception. Any lines in the file that are correct will be reflected in the frame, and any lines that produced an error will be placed in the error table reflected in the exception message.

Data modification in TQL

In addition to the data modification enabled by the load() and save() methods, xGT also supports direct modification of properties and graph elements as part of executing TQL queries.

We explain the syntax and semantics of those TQL constructs in the greater detail in the Cypher support in TQL and Introduction to the Trovares Query Language documents. In this section, we discuss how those constructs impact the execution of queries in xGT.

Property modifications and graph element additions and deletions are also stored in temporary buffers until the transaction commits (or rollbacks). In this way, those changes to the graph behave like the accumulated changes that are triggered by a load() operation. An important difference is that the accumulated changes are not immediately visible to the executing query, the query only sees the existing and consistent global state of the frames involved in it. Changes are write-only in this manner, until the transaction commits. Then a subsequent transaction will see those changes applied in the respective frames' new consistent global states.

Having said this, it is important to recognize that there is a possibility of a query writing to the same logical element concurrently. For example, a query could attempt to modify the same property of the same vertex instance twice, under different paths in the query's graph pattern. That is allowed by xGT and TQL, with the expectation that either of the concurrent writes could end up being committed to the global state of the vertex frame.

Background statistics collection

As discussed in Query Order Optimization in xGT, xGT can automatically optimize query execution by using statistics about the data. The collection of these statistics is performed in the background and controlled by the configuration variable metrics.cache described in Configuring xgtd. Because this is a background process, statistic collection transactions defer to user transactions and should typically not cause any user actions to fail. However, on rare occasion it is possible that a transactional conflict will occur, causing a user's transaction to rollback. If this occurs, the user transaction can be restarted.