4.4. Data Movement

The xGT tool implements a strongly typed, property graph data model. The fundamental building blocks of this model are frames: tabular representations of data with prescribed data types. In order to create a frame, a user specifies the names and data types of each property (called a schema) as well as any other information specific to that structure (for instance, the key property for a vertex frame). For more details see: create_table_frame(), create_vertex_frame(), and create_edge_frame().

When a user wants to load data, they do so via an existing frame. xGT provides support for a variety of different data sources and loading options on these frames. Users can get frame objects either from the output of the create_*_frame() methods or by explicitly calling the get_*_frame() method with the name or names of the frames the user wants. More detailed information about getting frames can be found in the API documentation: get_table_frame(), get_vertex_frame(), and get_edge_frame().

Here is an example of getting an edge frame and assigning it to a frame object named a_frame:

# *conn* is a connection to the server.
a_frame = conn.get_edge_frame('example__Name')

4.4.1. Getting Data into xGT

There are two ways to get data into xGT: across a network and from a filesystem. All forms of getting data into xGT can be expressed by calling load() on a respective frame object.

The signature of the load method is: load(paths, headerMode=xgt.HeaderMode.NONE) and is called on a frame object returned from the create_*_frame() or get_*_frame() methods. Here is an example of this:

a_frame = conn.get_edge_frame('example__Name')
a_frame.load('/data/path/my_edges.csv')

The load method provides some flexibility for ingesting a CSV file by location and via header rules. The paths parameter describes where xGT can find the CSV file or files and uses an optional protocol as a prefix to indicate what system the file is located on. This can be local to the script, local to the server, or an external server. The headerMode parameter is a flag indicating whether or not the CSV data sources contain a first line that is a header (i.e., column names) and how it should be handled. There are four modes:

  • xgt.HeaderMode.NONE means there is no header line in the CSV file.

  • xgt.HeaderMode.IGNORE means there is a header line, but the server should ignore it entirely.

  • xgt.HeaderMode.NORMAL means map the header to the schema in a relaxed way. It will ignore columns it can’t map to property names and fill in columns that aren’t mapped with null values.

  • xgt.HeaderMode.STRICT means xGT will raise an exception if the schema isn’t fully mapped or if additional columns exist in the file and not in the schema. The header line can be modified to work with “strict” mode by using the special string IGNORE in the header to ignore a column instead of producing an error.

4.4.1.1. Reading from the Client Filesystem

This method of loading data reads data from the filesystem local to the client. The path parameter is simply an absolute or relative path on the client’s local filesystem. This is indicated by using either no path prefix or the xgt:// protocol prefix. If there is no protocol supplied, the default behavior is to use the xgt:// protocol.

a_frame.load('/data/path/my_vertices.csv')
another_frame.load('xgt:///data/path/my_edges.csv', '/data/path/my_other_edges.csv')

These load calls request xgt to search the local filesystem for the files, then send them to the xGT server. This means it will go across the network for remote servers.

Data can also be ingested directly from a Python list by using the insert() method of a frame object.

a_frame.insert([[0, 0, "val1"], [1, 0, "val2"], [1, 5, "val3"]])

4.4.1.2. Reading from the Server Filesystem

This method of loading data is a request for the xGT server to read files from the remote filesystem where xGT is running. The path is preceded with the xgtd:// protocol telling xgt to pass off the request to the xGT server. This is much faster than loading with the client as it avoids having to transmit the data across the network.

a_frame.load('xgtd://data/path/example.csv')
another_frame.load(['xgtd://data/path/myfirst.csv', 'xgtd://data/path/mysecond.csv'])

4.4.1.3. Reading from a URL

This method of loading data asks the xGT server to retrieve CSV-formatted data from a URL. The protocol can be either http:// or https://.

a_frame.load('http://www.example.com/data/path/example.csv')
another_frame.load(['http://www.example.com/data/myfirst.csv', 'https://www.example.com/data/mysecond.csv'])

4.4.1.4. Reading from an AWS S3 Bucket

This method of loading data asks the xGT server to pull CSV-formatted data directly from an AWS S3 bucket and uses the s3:// protocol.

a_frame.load('s3://my-s3-bucket/data/path/example.csv')
another_frame.load(['s3://my-s3-bucket/data/myfirst.csv', 's3://my-s3-bucket/data/mysecond.csv'])

4.4.1.5. Handling Ingest Errors

If an error occurs during ingesting a valid file, Python will throw an error and the server will create a table containing the errors. The name of the error table will be in the frame property error_frame_name on the respective frame object where the exception occurred.

The error table will contain a list of lines that failed with corresponding error messages. Each row of the table corresponds to a line that had an ingest error. The first column contains the error message. The second column contains the line that failed to ingest, if available. The next columns contain the data that was encoded before the failure occurred. This will usually be empty since most errors occur before encoding.

The lines that were successfully ingested will be rows in the relevant frame. This frame is ready to be used. It is just missing rows for the lines that had ingest errors. This allows a user to resolve the bad lines and insert just those without having to reingest the entire file.

try:
  a_frame.load('test.csv')
except XgtError:
  error_table = conn.get_table_frame(a_frame.error_frame_name)

4.4.2. Getting Data out of xGT

The save() method of a frame object is used to request xGT to write data into a CSV file on either the client or server filesystem. The signature of the save method is: entity.save(path, offset=0, length=None, headers=False), and is called on a frame object the same as load.

There are only two variants of the path for a save method.

4.4.2.1. Saving to the Client Filesystem

This method of saving data writes frame data to a file on the filesystem local to the client. The user provides an absolute or relative path to the output file on the client’s local filesystem. This is indicated by using either no path prefix or the xgt:// protocol prefix. If there is no protocol supplied, the default behavior is to use the xgt:// protocol.

frame.save('xgt:///data/path/example.csv')
frame.save('../data/path/example.csv')
frame.save('../data/interesting.data.csv', offset=10, length=100, headers=True)

Note that this will send data from the server to the client before writing the file which is much slower than having the server write the file to disk, even if the client is running on the same machine as the server. It is recommended that only smaller datasets are saved directly to the client filesystem. The third frame.save() example shows how to limit the number of rows saved. It pulls at most 100 rows starting from row 10. It also demonstrates how to save the column names included as a header.

A saved file can be read directly into some other analytic tool such as MS Excel.

4.4.2.2. Saving to the Server Filesystem

This method of saving data is a request for the xGT server to write frame data to the remote filesystem where xGT is running. The user provides an absolute or relative path on the server’s filesystem using the xgtd:// protocol.

frame.save('xgtd://../data/path/example.csv')
frame.save('xgtd://../data/interesting.data.csv', offset=10, length=1000000, headers=True)

This method should be used when you know the size of the data is prohibitively large. It is certainly possible to copy the data elsewhere after saving to the server filesystem. For large datasets it is usually faster to save the file on the server and copy it to the client system than to have xGT directly save the file to the client system.