4.5. Data Movement

The xGT tool implements a strongly typed, property graph data model. The fundamental building blocks of this model are frames: tabular representations of data with prescribed data types. In order to create a frame, a user specifies the names and data types of each property (called a schema) as well as any other information specific to that structure (for instance, the key property for a vertex frame). For more details see: create_table_frame(), create_vertex_frame(), and create_edge_frame().

When a user wants to load data, they do so via an existing frame. xGT provides support for a variety of different data sources and loading options on these frames. Users can get frame objects either from the output of the create_*_frame() methods or by explicitly calling the get_*_frame() method with the name or names of the frames the user wants. More detailed information about getting frames can be found in the API documentation: get_table_frame(), get_vertex_frame(), and get_edge_frame().

Here is an example of getting an edge frame and assigning it to a frame object named a_frame:

# *conn* is a connection to the server.
a_frame = conn.get_edge_frame('Name')

In addition to supporting direct data transfers to frames via the xGT Python client API, the xgtd server also offers an Apache Arrow Flight endpoint. Frames are mapped to individual Arrow Flight objects. More information is available in Section Arrow Flight Server Endpoint.

4.5.1. Getting Data into xGT

There are multiple ways to get data into xGT: from a filesystem, from a network location, and from a Python object. Getting data into xGT from a filesystem or network location can be expressed by calling load() on a respective frame object. Getting data from a Python object is accomplished by calling insert() on a respective frame object.

The signature of the load method is: load(paths, header_mode = xgt.HeaderMode.NONE, ...) and is called on a frame object returned from the create_*_frame() or get_*_frame() methods. Here is an example of this:

a_frame = conn.get_edge_frame('Name')
job = a_frame.load('/data/path/my_edges.csv')

The load method can load CSV, Parquet files, or compressed CSV files. However, there are some limitations for Parquet and compressed CSV files (see Reading Parquet Files or Compressed CSV Files). Load also provides some flexibility for ingesting a CSV file by location and via header rules. The paths parameter describes where xGT can find the CSV file or files and uses an optional protocol as a prefix to indicate what system the file is located on. This can be local to the script, local to the server, or an external server.

The load method will return a completed job that indicates the status of the load.

4.5.1.1. Reading from the Client Filesystem

This method of loading data reads data from the filesystem local to the client. The path parameter is simply an absolute or relative path on the client’s local filesystem. A relative path is relative to the working directory of the Python script. This is indicated by using either no path prefix or the xgt:// protocol prefix. If there is no protocol supplied, the default behavior is to use the xgt:// protocol.

a_frame.load('/data/path/my_vertices.csv')
another_frame.load('xgt:///data/path/my_edges.csv', '/data/path/my_other_edges.csv')

These load calls request xgt to search the local filesystem for the files, then send them to the xGT server. This means it will go across the network for remote servers. Reading from the server filesystem is much faster than loading from the client filesystem as it avoids having to transmit the data across the network. It is strongly recommended to load large files from the server filesystem.

4.5.1.2. Reading from the Server Filesystem

This method of loading data is a request for the xGT server to read files from the remote filesystem where xGT is running. The path is preceded with the xgtd:// protocol telling xgt to pass off the request to the xGT server. Note that the actual location on the xGT server’s filesystem must be within the configured sandbox (see Enterprise Data Movement). All paths will be relative to the root sandbox directory, and any path that escapes the sandbox is invalid. The default sandbox location is /srv/xgtd/data/ but can be configured to be different by your system administrator. Consult your system administrator if you do not know the location of or have access to the sandbox directory.

a_frame.load('xgtd://data/path/example.csv')
another_frame.load(['xgtd://data/path/myfirst.csv', 'xgtd://data/path/mysecond.csv'])

4.5.1.3. Reading from a URL

This method of loading data asks the xGT server to retrieve CSV-formatted data from a URL. The protocol can be either http:// or https://.

a_frame.load('http://www.example.com/data/path/example.csv')
another_frame.load(['http://www.example.com/data/myfirst.csv', 'https://www.example.com/data/mysecond.csv'])

4.5.1.4. Reading from an AWS S3 Bucket

This method of loading data asks the xGT server to pull CSV-formatted data directly from an AWS S3 bucket and uses the s3:// protocol.

a_frame.load('s3://my-s3-bucket/data/path/example.csv')
another_frame.load(['s3://my-s3-bucket/data/myfirst.csv', 's3://my-s3-bucket/data/mysecond.csv'])

You will need to have S3 credentials set for this (see S3 Credentials).

4.5.1.5. Inserting from Python

Data can also be ingested directly from a Python list by using the insert() method of a frame object.

a_frame.insert([[0, 0, "val1"], [1, 0, "val2"], [1, 5, "val3"]])

Cypher lists can be inserted by using a Python list as a row element. In the following example the second element of the row is a list of strings:

a_frame.insert([[0, ["val1", "val2"]], [1, ["val2", "val3"]], [2, ["val3", "val4"]]])

4.5.1.6. Reading Into Edge Frames

Edges created from an edge frame load() or insert() will contain references to their respective source and target vertex endpoints. If any edge references a non-existent endpoint, these source or target endpoints will be automatically created.

If xGT is used with access control, this can impact loading into edge frames. If frame level access control is used, the user will require read and create access on the source and target vertex frames. This is required regardless if new endpoints are created during the ingest or not. If row level access control is used, the user will also need row level access to the endpoint vertices referenced by an ingested edge. For more information on access control, see Access Control.

4.5.1.7. Reading from a CSV File

When loading from a CSV file, the header_mode and frame_to_file_column_mapping parameters control how the data in the CSV file is ingested into the frame. The header_mode parameter is a flag indicating whether the CSV data sources contain a first line that is a header (i.e., column names) and how it should be handled. There are four modes:

  • xgt.HeaderMode.NONE means there is no header line in the CSV file.

  • xgt.HeaderMode.IGNORE means there is a header line, but the server should ignore it entirely.

  • xgt.HeaderMode.NORMAL means map the header to the schema in a relaxed way. If the column names in the header match the property names of the frame’s schema, then data will be automatically ingested into these properties. It will ignore columns it can’t map to property names and fill in columns that aren’t mapped with null values.

  • xgt.HeaderMode.STRICT (deprecated) means xGT will raise an exception if the schema isn’t fully mapped or if additional columns exist in the file and not in the schema. The header line can be modified to work with “strict” mode by using the special string IGNORE in the header to ignore a column instead of producing an error.

Consider this example:

id,firstName,lastName,gender,birthday
01,Alice,Roberts,unspecified,1965-04-10

This example is shown using the comma as the CSV delimiter.

If the schema has id,lastName,firstName,birthday, then the NORMAL mode will run without errors, aligning the third column as the second attribute, the second column as the third attribute, and ignoring the “gender” column. If, however, the head mode is STRICT, there would be an exception raised due to the inability to fully map the CSV columns with the frame’s schema properties. For the STRICT header mode, this schema would work fine: id,lastName,firstName,birthday,gender.

The frame_to_file_column_mapping parameter allows the user to specify which file column should be read into which property of the frame. This is useful if the file has no header or if the header column names do not match the names of the frame’s properties.

For the following examples using frame_to_file_column_mapping, assume a frame with the schema id,lastName,firstName,birthday and a file with the following header:

DOB,firstName,surname,location,ID

The names in the header are different than the properties of the schema, but the file can still be loaded correctly into the frame as shown below. frame_to_file_column_mapping should be a dictionary whose keys are names of the frame property and whose values indicate which file column should be read into that property. In this example, the file column names from the file header are used:

column_map = { 'id' : 'ID', 'lastName' : 'surname', 'firstName' : 'firstName', 'birthday' : 'DOB' }
a_frame.load('example.csv', header_mode = xgt.HeaderMode.NORMAL, frame_to_file_column_mapping = column_map)

Instead of using the name of the file column, the position of the column in the file (0-indexed) can be used:

column_map = { 'id' : 4, 'lastName' : 2, 'firstName' : 1, 'birthday' : 0 }
a_frame.load('example.csv', frame_to_file_column_mapping = column_map)

The dictionary can also contain a mix of names and positions:

column_map = { 'id' : 4, 'lastName' : 'surname', 'firstName' : 'firstName', 'birthday' : 0 }
a_frame.load('example.csv', header_mode = xgt.HeaderMode.NORMAL, frame_to_file_column_mapping = column_map)

The same data column can be inserted into more than one frame property. Below, the second file column is inserted into both the “lastName” and “firstName” properties.

column_map = { 'id' : 4, 'lastName' : 1, 'firstName' : 1, 'birthday' : 0 }
a_frame.load('example.csv', frame_to_file_column_mapping = column_map)

If the frame_to_file_column_mapping parameter is used, it should contain all frame properties into which data should be inserted. If any frame column is missing from this dictionary, null values will be inserted into it for all ingested rows. This is the case even if the file header contains a column name that matches a property name. In the example below, the “firstName” property will be null for all ingested rows:

column_map = { 'id' : 'ID', 'lastName' : 'surname', 'birthday' : 'DOB' }
a_frame.load('example.csv', header_mode = xgt.HeaderMode.NORMAL, frame_to_file_column_mapping = column_map)

Note that the use of frame_to_file_column_mapping with header mode STRICT is not supported. If frame_to_file_column_mapping contains any file column names, these are obtained from the header. Therefore, in this case the header mode NORMAL must be used. If only file column positions are used, the header mode can be NORMAL, IGNORE, or NONE.

If frame_to_file_column_mapping is not passed in and header mode IGNORE or NONE is used, then xGT will use the order of columns in the file to ingest data. The first file column will be inserted into the first property of the frame’s schema, the second into the second, and so on. The order of the schema is determined when the frame is created and can be seen in schema. Note that this automatic mapping may also be affected if the row_label_columns parameter is used to designate certain file columns as security label because any file column cannot be used as both a frame data column and a security label column (see Section Setting Row Labels for adding security labels).

The expected CSV delimiter is by default a comma, but can be set with the delimiter parameter.

The last thing to note is that load will ignore any empty lines it finds in the file.

4.5.1.8. Reading Parquet Files

Parquet files are the columnar storage format of the Hadoop ecosystem. The data in this format can more easily compressed due to the columnar nature. Therefore, it’s used by many big-data processing frameworks.

Currently, xGT supports loading Parquet files from the server or client filesystem. Loading from network protocols such as s3 or http is currently not supported. In general, Parquet’s various internal compressions are supported such as: bz2, zlib, lz4, snappy, zstd and Brotli. These types of files require the parquet extension (.parquet) in order to be identified by xGT.

xGT allows for loading Parquet files where the Parquet schema is longer the xGT frame schema. Any additional columns will be ignored. In cases where the schema is shorter or needs to be mapped, xGT supports using the frame_to_file_column_mapping parameter when loading Parquet files to allow mapping columns in a Parquet file to columns in a frame. A detailed explanation of how mapping works can be found in Reading from a CSV File. When ingesting Parquet, xGT uses the schema column names to map instead of a CSV header. Since the header modes are ignored, they do not affect the mapping behavior.

xGT supports loading into all xGT types. A table below is provided for mapping Parquet types to xGT types. All the types and mappings in table below are supported within the nesting list type. The following Parquet types are not supported: int96 or byte_array, and the following Parquet logical types are not supported: enumeration, UUID, interval, JSON, BSON or maps.

The following table indicates the supported mappings from Parquet extended types (types and logical types) to xGT types:

xGT Types

Parquet Types

Boolean

Integer

Float

Date

Time

Date Time

IP Address

Text

List

Boolean

Signed/Unsigned Integers

Float/Double

Decimal

✓ (128 bit only)

String

Date

Time

Date Time

Null

List

Parquet files are divided into metadata and data. Data is partitioned into chunks of rows called row groups. We recommend using row groups of at least 1000 lines to allow for parallel reading of the files. There is a known issue with large files over ~5 gigabytes not working with Apache Arrow if the metadata becomes too large, so we recommend splitting up larger files or increasing the row group size. xGT will validate the schema conversion before ingesting the data. However, in some cases for missing data individual errors for specific rows may be returned e.g. empty key columns for an edge. In this case, the line number will be relative to the row group number.

Some parameters such as header mode or delimiter are ignored when reading Parquet files.

4.5.1.9. Compressed CSV Files

When loading data with CSV files, two types of compression are supported: bzip2 and gzip. These files need to have bz2 or gz extensions to identify the file types. We support loading compressed files remotely with the xgtd protocol or from other network locations such as the web or s3. However, loading from the client filesystem is currently not supported. In general there is limited parallelism when decompressing these files, so we recommend splitting them into multiple files.

4.5.1.10. Setting Row Labels

If a frame supports row security labels, they can be attached to the data during ingest. Only a user that has those security labels in their label set will be able to access that vertex, edge, or table frame row when reading data from the frame, including during a TQL query or when egesting from the frame. The label set of a user will be configured by the administrator as described in Configuring Groups and Labels.

For an ingest operation performed with either load() or insert(), it is possible to attach row labels in one of two ways. The first option is to attach the same set of labels to each new row. This is done by passing in the row_labels parameter, which should be a list or set of string labels. These labels must exist in the xgt__Label frame, as described in Configuring Groups and Labels. The following example shows adding security labels this way.

a_frame.load(
    paths = '/data/path/my_vertices.csv',
    row_labels = ["label2", "label3", "label9"])

The second option is to specify different labels for each row ingested during a single load() or insert() call. This is done by adding security labels as additional columns to the ingested data and using the row_label_columns parameter to indicate which column is a label.

When calling insert(), any non-label column must contain schema data in the appropriate order. In the example below, the vertex with key 0 is inserted with one security label: “label1”. The vertex with key 1 is inserted with two security labels: “label2” and “label4”. The vertex with key 2 is inserted with no security labels.

a_frame = xgt.create_vertex_frame(
    name = 'Name',
    schema = [['id', xgt.INT], ['name', xgt.TEXT]],
    key = 'id',
    row_label_universe = ["label1", "label2", "label3", "label4", "label5"])
    )
a_frame.insert(
    data = [[0, "val1", "label1", ""], [1, "val2", "label2", "label4"], [2, "val1", "", ""]],
    row_label_columns = [2, 3])

Note that the number of elements in each list of the data parameter must be the same for each row. To attach up to N security labels to a row for the ingest operation, the length of each list must be equal to the length of the frame schema plus N. The order of the labels in each list does not matter.

When calling load() with the row_label_columns parameter, the CSV file can contain additional non-schema columns with security labels. The number of columns in each row must be the same and if any row has fewer labels, that column element should be empty. The order of the labels in each row of the file does not matter. For example, to ingest into a vertex frame with schema [['id', xgt.INT], ['name', xgt.TEXT]], the CSV file with header might look like this:

id, name, label_col1, label_col2, label_col3
0, val1, label1,,
1, val2, label2, label4,
2, val1,,,
3, val3, label9, label8, label1

If the header mode is xgt.HeaderMode.NONE or xgt.HeaderMode.IGNORE, the row_label_columns parameter, if passed in, must be a list of integer column indices, indicating those columns that contain security labels. For example, the data in the sample file above would be ingested as follows:

a_frame.load(
    data = '/data/path/my_vertices.csv',
    header_mode = xgt.HeaderMode.IGNORE,
    row_label_columns = [2, 3, 4])

If the header mode is xgt.HeaderMode.NORMAL or xgt.HeaderMode.STRICT, then the header column names help determine the label columns. In this case, the row_label_columns parameter must be a list of strings, indicating the names of columns that contain security labels:

a_frame.load(
    data = '/data/path/my_vertices.csv',
    header_mode = xgt.HeaderMode.NORMAL,
    row_label_columns = ["label_col1", "label_col2", "label_col3"])

Note that the security label columns in the ingested CSV file can be any columns. They do not need to be placed at the end after schema columns. Only one of row_label_columns and row_labels can be passed into a load() or insert() call.

When ingesting into an edge frame, if a source or target vertex does not exist, it will be automatically created. Row labels can be attached to these automatically created endpoints via the parameters source_vertex_row_labels and target_vertex_row_labels on load() or insert(). The following example shows adding security labels this way.

an_edge_frame.load(
    data = '/data/path/my_edges.csv',
    source_vertex_row_labels = ["label2", "label3", "label9"],
    target_vertex_row_labels = ["label1", "label3", "label5"])

If the source and target vertex frames are the same for an edge frame, the source and target vertex security labels need to be the same on insert or load.

4.5.1.11. Handling Ingest Errors

There are a wide range of errors that may result in having an ingest operation not complete as intended. One type of error is the inability to find the file. This could occur if providing an incorrect path to the file. Or it could be that the userid of the running xGT server (xgtd) does not have read permission on the file when the xgtd:// protocol is specified.

If a correct path is given and the file permissions are satisfied, many possible errors remain, for example:

  • The frame’s schema could be wrong by having too many or too few columns.

  • The frame’s schema could be wrong by having the wrong data type given for one or more of the columns.

  • The CSV file may have a non-standard value delimiter (e.g., '|' rather than ',').

  • The process used to build the CSV file may have incorrectly formatted certain data types (e.g., not surrounding a string containing the delimiter character with double quotes).

With these kinds of ingestion errors based on the CSV format and frame’s schema, xGT may be able to ingest only some rows in the CSV file. If that occurs, there will be a Python exception raised with a message describing the outcome of the ingest operation (e.g., the number of lines that led to ingest errors) and a detailed message with up to 10 errors explaining what xGT perceived went wrong. These error messages are produced based on limited understanding of what is really wrong. For example, it is not possible to know whether a row in the CSV is ill-formed or if the frame’s schema was created incorrectly. So, these error messages are intended to provide the user as much guidance as possible on how to quickly determine the cause of the problem.

If an ingest operation encounters no errors, the outcome will be that all the ingested data is added to the graph. But if an ingest operation does encounter an error, by default xGT will raise an exception to inform the user of the error. However, if suppress_errors is set to true on a load, and an ingest operation encounters one or more errors, xGT will insert all the ingested data without errors and raise an exception to inform the user of the errors. In this case of some data being correct and some data having errors, the updated frame is ready to be used; it is just missing rows for the lines that led to ingest errors. If all the ingested data leads to errors, then the frame will remain unchanged. This could happen, for example, if a frame is created with an incorrect schema.

4.5.1.12. Digging Deeper into Ingest Errors

If suppress_errors is on and the user wishes to see more than 10 error messages, xGT provides a way to access up to 1,000 messages. Note that for situations such as a bad schema, when trying to ingest a 1,000,000-row CSV file, there is no way to retrieve all 1,000,000 error messages.

To see all the first 1,000 errors, the user’s script will need to catch the raised XgtIOError exception. The exception object has a property that represents the job object for the ingest operation. And the first 1,000 errors are available from that job object. These errors can be retrieved via get_data() or get_data_pandas().

The error rows will contain a list of lines that failed with corresponding error messages. Each row corresponds to a line that had an ingest error. The first column contains the error message. The second column contains the file that contained the error, if available. Otherwise, it contains an empty string. For CSV files, the third column contains the line number that contained the error, if available. In insert operations, the third column will be the position in the list passed into insert. With CSV files, this number will begin at 1, and for insert operations, this will start at 0. If the position or line number is not available, this will be -1. The fourth column contains the text of the line that failed to ingest, if available. Otherwise, it contains an empty string. The remaining columns contain the data that was encoded before the failure occurred. This will usually be empty since most errors occur before encoding.

try:
  job = a_frame.load('test.csv')
except xgt.XgtIOError as e:
  error_rows = e.job.get_data()
  # Either print the exception object and the error_rows data structure,
  # or perform some operations over the error_rows data structure.

4.5.2. Getting Data out of xGT

The save() method of a frame object is used to request xGT to write data into a file on either the client or server filesystem. Currently saving supports both CSV and Parquet. However, saving compressed CSVs is not supported. To save as a Parquet file use the .parquet extension when specifying the path. There are some limitations for Parquet files (see Writing Parquet Files). The signature of the save method is: save(path, offset=0, length=None, headers=False, ...), and is called on a frame object the same as load. There are only two variants of the path for a save method: the client filesystem and the server filesystem. When saving, the ordering of data on the server may not be preserved for performance reasons. To always guarantee the ordering, set the parameter preserve_order to true.

Note that if a frame has row security protection, not all data in a frame may be visible. As described in Security Labels, an authenticated user has security labels that changes the visibility of data. A Python user can only see or egest frame rows whose security labels are a subset of the labels in the user’s label set (user labels).

4.5.2.1. Saving to the Client Filesystem

This method of saving data writes frame data to a file on the filesystem local to the client. The user provides an absolute or relative path to the output file on the client’s local filesystem. This is indicated by using either no path prefix or the xgt:// protocol prefix. If there is no protocol supplied, the default behavior is to use the xgt:// protocol.

frame.save('xgt:///data/path/example.csv')
frame.save('../data/path/example.csv')
frame.save('data/interesting.data.csv', offset = 10, length = 100, headers = True)

Note that this will send data from the server to the client before writing the file which is much slower than having the server write the file to disk, even if the client is running on the same machine as the server. It is recommended that only smaller datasets are saved directly to the client filesystem. The third frame.save() example shows how to limit the number of rows saved. It pulls at most 100 rows starting from row 10. It also demonstrates how to save the column names included as a header.

A saved file can be read directly into some other analytic tool such as MS Excel.

4.5.2.2. Saving to the Server Filesystem

This method of saving data is a request for the xGT server to write frame data to the remote filesystem where xGT is running. The user provides a relative path on the server’s filesystem using the xgtd:// protocol. As with loads the actual location on the xGT server’s filesystem must be within the configured sandbox. All paths will be relative to the root sandbox directory, and any path that escapes the sandbox is invalid. When saving to the server you can also split the save into multiple files by using the parameter, number_of_files. This will append a number between the extension and the rest of the file to indicate the part.

frame.save('xgtd://data/path/example.csv')
frame.save('xgtd://data/interesting.data.csv', offset = 10, length = 1000000, headers = True)
frame.save('xgtd://data/path/example.parquet', number_of_files = 2)

This method should be used when you know the size of the data is prohibitively large. It is certainly possible to copy the data elsewhere after saving to the server filesystem. For large datasets it is usually faster to save the file on the server and copy it to the client system than to have xGT directly save the file to the client system.

4.5.2.3. Saving to Python

To easily get a small amount of frame data into Python, the get_data() or get_data_pandas() method of a frame object can be called. This returns frame data as a list of lists or as a pandas frame. Data from these methods will always be the same order as they are stored on the server.

4.5.2.4. Writing Parquet Files

xGT supports writing Parquet files on the server and client filesystem. In addition, the parameters headers, include_row_labels, and row_label_column_header aren’t supported for Parquet. When writing Parquet files, xGT will use Snappy compression, a row group size of 10,000, and the column names will correspond to the names of the columns in the frame.

xGT maps its native data types to Parquet types and logical types in the following manner (See Parquet Logical Types for more information about how typing works in Parquet files.):

Parquet Type and Logical Type Mapping

xGT Type

Parquet Type

Observations

BOOLEAN

BOOLEAN

1-bit boolean.

INTEGER

Logical INT(64, true)

Signed 64-bit integer.

FLOAT

FLOAT

IEEE 32-bit floating point number.

DATE

Logical DATE

32-bit int representation.

TIME

Logical TIME(true, MICROS)

64-bit time representation with microsecond precision and UTC adjustment.

DATETIME

Logical TIMESTAMP(true, MICROS)

64-bit timestamp representation with microseconds since Unix epoch and UTC adjustment.

IPADDRESS

Logical STRING

UTF-8 encoded byte array.

TEXT

Logical STRING

UTF-8 encoded byte array.

LIST

List

The element type will use the above mappings when converting and the nesting levels will be equivalent.

4.5.2.5. Saving Row Labels

If a frame supports row labels, those labels can be egested along with the frame data by setting the parameter include_row_labels to true. By default, it is set to false and no labels are egested.

When retrieving data with get_data() passing include_row_labels = True, the labels for each row will be appended as columns following the data columns. Each row will have one column per security label found in row_label_universe. In other words, there will be one column per security label that is both in the frame’s row label universe and in the user’s label set. If the corresponding label is not attached to that row, the column will be the empty string. The example below shows retrieving data from a frame with a row label universe of “label1”, “label2”, “label4”.

a_frame = xgt.create_vertex_frame(
    name = 'Name',
    schema = [['id', xgt.INT], ['name', xgt.TEXT]],
    key = 'id',
    row_label_universe = ["label1", "label2", "label4"])
    )
a_frame.insert(
    data = [[0, "val1", "label1", ""], [1, "val2", "label2", "label4"], [2, "val1", "", ""]],
    row_label_columns = [2, 3])

data = a_frame.get_data(include_row_labels = True)
# data is:
# [[0, 'val1', 'label1', '', ''], [1, 'val2', '', 'label2', 'label4'], [2, 'val1', '', '', '']]

When egesting with save(), security labels are written as additional columns to the CSV file if the parameter include_row_labels is set to true. Each row in the file will have one column per security label in the frame’s row label universe visible to the user. The parameter row_label_column_header can be passed in to indicate the header column name for any label column in the file. Otherwise, the default label column name is “ROWLABEL”. This example egests a frame with row labels to a file.

a_frame = xgt.create_vertex_frame(
    name = 'Name',
    schema = [['id', xgt.INT], ['name', xgt.TEXT]],
    key = 'id',
    row_label_universe = ["label1", "label2", "label4"])
    )
a_frame.insert(
    data = [[0, "val1", "label1", ""], [1, "val2", "label2", "label4"], [2, "val1", "", ""]],
    row_label_columns = [2, 3])

a_frame.save("FrameOutput.csv", include_row_labels = True,
             row_label_column_header = "security_label")

The resulting content of “FrameOutput.csv” is shown below:

id, name, security_label, security_label, security_label
0, val1, label1, ,
1, val2, ,label2, label4
2, val1, , ,

The include_row_labels parameter can also be passed to get_data_pandas() so that the labels are returned as additional columns in the Pandas DataFrame. The optional parameter row_label_column_header is used to set the name of the label columns, which is by default “ROWLABEL”.

pandas_data_frame = a_frame.get_data_pandas(include_row_labels = True)

4.5.3. Performance Considerations

For filesystem loads and saves, reading or writing from the server is orders of magnitude faster than from the client. This is true even if the client is running on the same machine as the server. The server is able to employ parallelism that isn’t available in the Python client when loading and saving files. Additionally, data must be serialized, sent through a gRPC channel, and unserialized when transferring between client and server.

4.5.4. S3 Credentials

When loading a file using the load() method and an s3: protocol prefix, xGT will check the xgtd user’s home for the .aws/credentials file with the variables aws_access_key_id and aws_secret_access_key. If the credentials file doesn’t exist, xGT will check for the configuration variables aws.access_key_id and aws.secret_access_key. The online AWS Access Keys document explains what these keys are and contains references to learn how to create and manage them. (These values may also be specified at runtime in user code by passing them to a Connection object.)

When reading the .aws/credentials file, xGT also supports profile selection via the AWS_PROFILE environment variable. If no environment variable is found, xGT will use the default profile.

4.5.5. Arrow Flight Server Endpoint

In addition to the native data movement capabilities provided by xGT’s client and server, xGT also provides an Arrow Flight Server endpoint. The Fight server endpoint is serviced from the same gRPC port as xGT’s native data movement (4367). Arrow clients that use the Arrow port on the xGT server can be written in different languages, including Python (pyarrow), C++, Java, and many other languages used for data analytics.

Each frame that the user has access to is mapped to an individual flight object, using a two-level path flight descriptor corresponding to the namespace and the name of each frame (e.g. graph__VertexFrame is mapped to a path descriptor for (graph, VertexFrame). Note that frames must be specified using their fully qualified name, that is including their namespace, when being accessed through the Arrow endpoint.

The frame’s xGT schema is mapped to a compatible schema using Arrow data types.

Currently, xGT’s Arrow flight server implements get and put capabilities to each individual frame/flight according to the user’s access permissions. xGT’s Arrow flight server does not currently support exchange or action protocols. It is convenient to invoke the list flights action on xGT’s Arrow server to list all frames that the user has access to with their mappings to flight paths, their Arrow Flight schemas as well as other relevant properties. More information on Arrow’s client and server interfaces is available at: Arrow Python bindings (for Python), as well as Arrow documentation (for more general documentation and other languages).

Data I/O operations through xGT’s Arrow server utilize the same authentication, access control and transactional backend components as xGT’s native I/O operations. This allows for safe, authenticated, concurrent access to frames via Arrow flight operations, native I/O operations, as well as read-only and read-write TQL queries.

It is possible to have simultaneous connections to a single xGT server through the Arrow client and through the native xGT client. The following example connects to xGT’s Arrow flight server using the Arrow Python client pyarrow and the native xGT client xgt.

# Connect to the xgtd server through Arrow Flight.  The default xGT server
# port is 4367.
arrow_conn = pyarrow.flight.FlightClient(("localhost", 4367))
# Supply valid authentication credentials.
# basic_client_auth_handler must be a client authentication handler that
# implements the pyarrow.flight.ClientAuthHandler interface.
arrow_conn.authenticate(basic_client_auth_handler)

# Create a regular connection to the xgtd server, also supplying valid
# credentials.
conn = xgt.Connection()

# Create a frame on the xGT server.
frame = conn.create_table_frame(
  name   = 'graph__Table',
  schema = [['col0', xgt.INT]])

# Insert some data into the frame using xGT's native I/O protocol.
row_data = [ [x] for x in range(0, 100) ]
frame.insert(row_data)

# Now get the data back from xGT in Arrow format from the Arrow Flight server
# endpoint.
xgt_table = arrow_conn.do_get(pyarrow.flight.Ticket(b"graph__Table")).read_all()

# Convert the retrieved Arrow-formatted data into a Pandas frame.
pandas_table = xgt_table.to_pandas()

The example presented above illustrates using the native xGT client to create a frame and write data and using the Arrow client to read data. Of particular interest is the fact that the example maintains two separate connections to the server: an Arrow connection and a native xGT client connection.

xGT maps its native data types to Arrow Flight data types in the following manner during a do_get:

Data Type Mapping

xGT Type

Arrow Type

Observations

BOOLEAN

arrow::boolean

INTEGER

arrow::int64

Signed 64-bit integer.

FLOAT

arrow::float32

32-bit floating point number.

DATE

arrow::date32

32-bit date representation.

TIME

arrow::time64(MICRO)

64-bit time representation with microsecond precision.

DATETIME

arrow::timestamp(MICRO)

64-bit timestamp representation with microsecond precision.

IPADDRESS

arrow::utf8

String representation of an IPv4 address.

TEXT

arrow::utf8

utf8 string.

LIST

arrow::list

Element type of the list is mapped accordingly.

xGT can map the following Arrow Flight data types to xGT types during a do_put:

xGT Types

ArrowTypes

Boolean

Integer

Float

Date

Time

Date Time

IP Address

Text

List

boolean

uint/int(64, 32, 16, 8)

float(64, 32)

decimal(256, 128)

✓ (128 bit only)

string/utf8

date(64, 32)

time(64, 32)

timestamp(s, ms, us, ns)

null

list

The example below illustrates the use of Arrow’s put capability to send some local data (stored in Arrow format) to xGT’s Arrow endpoint for the frame graph__Table. It also illustrates how xGT types are mapped to Arrow types.

# Create a frame on the xGT server.
frame = conn.create_table_frame(
  name   = 'graph__Table',
  schema = [['col0', xgt.INT], ['col1', xgt.TEXT]])

# Set up a local Arrow table to be sent to the server.

# Note that Arrow data is stored in columnar format ("column-major") instead
# of row format ("row-major").
data = [ [ i for i in range(num_rows) ],
         [ "val" + str(i) for i in range(num_rows) ] ]

fields = [ pyarrow.field('col0', pyarrow.int64()),
           pyarrow.field('col1', pyarrow.utf8()) ]

table = pyarrow.Table.from_arrays(data, schema = pyarrow.schema(fields))

writer, _ = arrow_conn.do_put(
  pyarrow.flight.FlightDescriptor.for_path('graph', 'Table'), table.schema)
writer.write_table(table)
writer.close()

Additional options to configure how the data transfers through xGT’s Arrow endpoint are provided by passing in those options as part of the flight descriptor ticket. When calling the method to build a flight descriptor for put, additional parameters are available that allows for configuration options. These parameters are specified at the end of the path. An example would be disabling implicit vertex creation when reading in data for an edge frame/flight:

writer, _ = arrow_conn.do_put(
pyarrow.flight.FlightDescriptor.for_path('ns', 'e1', '.implicit_vertices=false'), e_table.schema)
# Multiple parameters are passed like so:
writer, _ = arrow_conn.do_put(
pyarrow.flight.FlightDescriptor.for_path('ns', 'e1', '.implicit_vertices=false', '.label_column_indices=1,4'), e_table.schema)

The following I/O options are supported by xGT’s Arrow endpoint when writing frames:

Arrow I/O put options for xGT

Option

Description

.suppress_errors

If set to true, will ingest all valid rows. If errors are encountered, it will raise an exception after all rows have been ingested with up to 1000 errors. Otherwise will raise on the first error encountered and not ingest the data. By default, this is false.

.implicit_vertices

If set to true, then vertex instances are created if they don’t exist when reading in data for an edge frame/flight. By default, this is true.

.labels

List of row labels to add for row-level data when reading in data for a frame/flight. This is optional.

.label_column_names

List of column names that store labels for row-level data when reading in data for a frame/flight. Both labels and label column names cannot be set. This is optional.

.implicit_source_vertex_labels

List of the default row-level labels used for implicit source vertex instances. This is optional.

.implicit_target_vertex_labels

List of the default row-level labels used for implicit target vertex instances. This is optional.

.label_column_indices

List of the column positions that store labels for row-level data when reading in data for a frame/flight. This is optional.

.map_column_names

List of mappings in the format key:value where key is the Arrow column name to map to the frame column name. Example: [arrow_col1:xgt_col2, arrow_col2:xgt_col1] This is optional.

.map_column_ids

List of mappings with the format key:value where key is the Arrow column name to map to the frame column position. Example: [arrow_col1:2, arrow_col2:1] This is optional.

When creating a ticket for a flight read, specify the fully qualified name of the frame followed by optional parameters in the format, NAME[.OPTION=FOO]. An example of reading a table with an offset and maximum length to return specified:

xgt_table = arrow_conn.do_get(pyarrow.flight.Ticket(b"graph__Table.offset=10.length=50")).read_all()

The following I/O options are supported by xGT’s Arrow endpoint when retrieving frames:

Arrow I/O get options for xGT

Option

Description

.offset

Index of the first row to retrieve. Default is 0.

.length

Maximum number of rows to retrieve. If not set, will return all the rows.

.order

If set to true, the order of the data will be the same as it is stored on the server. Otherwise, the order is not guaranteed. Default is false.

.dates_as_strings

If set to true, Date, Time, and DateTime types will be returned as strings. Default is false.

.egest_row_labels

If set to true, the security labels for each row will be egested along with the row. Default is false.