2.18. Transactions

In order to achieve performance and scalability on very large data, xGT’s core engine is a highly parallel piece of software. For this reason, it is important to understand how xGT keeps data safe and consistent no matter how many complex analytics are modifying or augmenting the underlying data store.

The natural outcome of a naive system allowing simultaneous, unchecked data modification and retrieval in a concurrent system is non-determinism: the results of a query would depend on the particular order in which it was executed and may change or even become self-contradictory. Clearly, this outcome is not what a data scientist needs or expects.

xGT prevents this scenario by building around a formal set of rules which govern how data reads and writes interact, called a consistency model. To avoid excessive complexity, xGT wraps these rules up into a familiar concurrent interface: transactions. This document explains how transactions are used and provides several example scenarios of how xGT behaves in the presence of concurrent data modification and query execution.

2.18.1. Transaction Basics

Transactions have a lifecycle that includes three states: execution and either commit or rollback. When a transaction begins, the system creates a private space for it to make changes. Like a parallel universe, modifications performed are not directly applied to globally visible data, instead being accumulated into this private space for later. When execution is complete, xGT attempts to apply the collected changes to the global state. If the modifications cannot be applied safely (for instance, if another transaction has made conflicting changes to the same data), then the transaction is “rolled back” and all the changes in the private space are discarded. The data will be left as it was originally. Alternatively, if the modifications are successfully applied, then the transaction is said to have “committed” and these changes will be visible to every other transaction which commits from that point forwards.

In many respects, transactions in xGT are similar to those found in databases in that they provide certain guarantees:

  • Atomicity: A transaction is a sequence of operations that either executes successfully and applies all its effects on the system, or it fails and none of its operations have any effect on the system. There is no in-between.

  • Consistency: A transaction will always leave the data in a usable state in which all components agree about the state of things. For instance, if a query adds an edge, that edge will always be visible, no matter whether it is searched by looking at all edges or traversed from a connected vertex.

  • Isolation: xGT guarantees that if a transaction succeeds, its results will be the same no matter what else was running on the system. Empty or running at capacity, you see the same thing.

It is worth reiterating that xGT is not a database, and a shrewd observer will note that xGT’s transactions do not guarantee durability (the fourth promise in the ACID model used by most databases). xGT is an in-memory analytics engine and delivers dramatically higher performance on complex queries by avoiding disk access. Some users choose to complement xGT with a slower, traditional database system or other persistent storage solution to archive their data.

2.18.2. Transactions in Python

Most of the time, users won’t need to worry about transactions at all. The xgt library automatically wraps every method call in a transaction and hands it off to the xGT server. If no data conflict occurs, the method returns and life goes on. By using transactions throughout, xGT guarantees that all of a user’s interactions are protected from concurrent modification.

If something does go wrong and a transaction needs to be rolled back, xGT will signal this behavior to the user by raising an XgtTransactionError and setting the status of a job to rollback.

Two cases are worth discussing in more detail: getting data to and from disk, and modifying data with TQL queries.

2.18.3. Loading and Saving Data

xGT provides the load() and save() method calls in our Python API to enable the user’s data to be moved into or out of xGT. Calls to either of these functions can take a significant amount of time, particularly for very large files or collections of files with billions of graph entities stored in them.

Both load() and save() calls are frame-oriented, which means that only one particular frame is being affected by the call. Other frames which are not being loaded or saved are unaffected and remain available to the user for queries (or for loading and saving as well). This makes it more likely for a load() or save() call to avoid having to roll back.

The save() call is simpler to understand: the frame being saved still remains available for queries reading from it at all times during the execution of the save(). Thus, it is also possible to have more than one simultaneous save() call going on at the same time for the same frame. Only queries that try to modify or augment the data of that frame (including load() calls to the same frame), will not commit while the save() call is being processed. This prevents inconsistent data from being written out to disk. Queries that would conflict with a save() call are prevented from starting until the call finishes. Once it completes, they can start as normal.

The execution of a load() call is a bit more involved since it involves both adding new data to an existing frame and updating other connected data. For instance, a load() call on a vertex frame will also involve updates to any edge frames that are using that vertex frame as an endpoint. A load() on an edge frame might also trigger the insertion of new vertices into a vertex frame. For these reasons, a load() call might conflict with several frames at the same time, blocking queries and other load or save operations which use those frames.

For both of these cases, remember that xGT will always guarantee that its data is left in a consistent, usable state. It is simply possible that a particular operation might need to be re-executed (if it is still valid) or re-thought (if some other operation has changed data on which the operation relied).

Any errors that occur during a load call (such as having a malformed line) will cause the job status to show as failed and cause the load command to raise an XgtIOError exception. Any lines in the file that are correct will be reflected in the frame, and any lines that produced an error will be stored in the job property of the exception. See Handling Ingest Errors for more detail.

2.18.4. Modifying Data in Queries

In addition to the data modification enabled by the load() and save() methods, xGT also supports direct modification of properties and graph elements as part of executing TQL queries.

We explain the syntax and semantics of those TQL constructs in greater detail in TQL for Cypher Users and The Trovares Query Language (TQL). In this section, we discuss how those constructs impact the execution of queries in xGT.

Property modifications and graph element additions and deletions are also stored in temporary buffers until the transaction commits (or rollbacks). In this way, those changes to the graph behave like the accumulated changes that are triggered by a load() operation. An important difference is that the accumulated changes are not immediately visible to the executing query part (more details in Multiple Query Parts), the current query part only sees the existing and consistent global state of the frames involved in it. Changes are write-only in this manner, until the query part finalizes. The changes become visible to subsequent query parts in the same TQL script. Changes are not globally visible to other transactions until the current transaction successfully commits. A subsequent transaction will see those changes applied in the respective frames’ new consistent global states.

Having said this, it is important to recognize that there is a possibility of a query writing to the same logical element concurrently. For example, a query could attempt to modify the same property of the same vertex instance twice, under different paths in the query’s graph pattern. That is allowed by xGT and TQL, with the expectation that either of the concurrent writes could end up being committed to the global state of the vertex frame.

2.18.5. Background Statistics Collection

As discussed in Query Optimization, xGT can automatically optimize query execution by using statistics about the data. The collection of these statistics is performed in the background and controlled by the configuration variable metrics.cache described in Configuring the Server. Because this is a background process, statistic collection transactions defer to user transactions and should typically not cause any user actions to fail. However, on rare occasion it is possible that a transactional conflict will occur, causing a user’s transaction to rollback. If this occurs, the user transaction can be restarted.

2.18.6. Implicit Frame Creation

Currently, the methods that implicitly create frames from data, create_vertex_frame_from_data(), create_edge_frame_from_data(), and create_table_frame_from_data(), are not fully transactional. In each, creating a frame and loading data into it are performed as two separate transactions. This means that an IO failure can leave the xGT server in a state in which an empty frame exists.

When creating an edge frame with create_edge_frame_from_data(), source and target vertex frames are created if they do not already exist. Creating a source frame, creating a target frame, creating an edge frame, and loading data are all performed as separate transactions. It is possible that one or more vertex frames will be created, but an error occurs while creating the requested edge frame. In that case, the user can manually delete any vertex frames that were created with drop_frame().