4.4. Data Movement¶
The xGT tool implements a strongly typed, property graph data model.
The fundamental building blocks of this model are frames: tabular representations of data with prescribed data types.
In order to create a frame, a user specifies the names and data types of each property (called a schema) as well as any other information specific to that structure (for instance, the key property for a vertex frame).
For more details see: create_table_frame()
, create_vertex_frame()
, and create_edge_frame()
.
When a user wants to load data, they do so via an existing frame. xGT provides support for a variety of different data sources and loading options on these frames.
Users can get frame objects either from the output of the create_*_frame()
methods or by explicitly calling the get_*_frame()
method with the name or names of the frames the user wants.
More detailed information about getting frames can be found in the API documentation:
get_table_frame()
, get_vertex_frame()
, and get_edge_frame()
.
Here is an example of getting an edge frame and assigning it to a frame object named a_frame
:
# *conn* is a connection to the server.
a_frame = conn.get_edge_frame('example__Name')
4.4.1. Getting Data into xGT¶
There are multiple ways to get data into xGT: from a filesystem, from a network location, and from a Python object.
Getting data into xGT from a filesystem or network location can be expressed by calling load()
on a respective frame object.
Getting data from a Python object is accomplished by calling insert()
on a respective frame object.
The signature of the load method is: load(paths, headerMode = xgt.HeaderMode.NONE, ...)
and is called on a frame object returned from the create_*_frame()
or get_*_frame()
methods. Here is an example of this:
a_frame = conn.get_edge_frame('example__Name')
job = a_frame.load('/data/path/my_edges.csv')
The load method provides some flexibility for ingesting a CSV file by location and via header rules.
The paths
parameter describes where xGT can find the CSV file or files and uses an optional protocol as a prefix to indicate what system the file is located on.
This can be local to the script, local to the server, or an external server.
The headerMode
parameter is a flag indicating whether or not the CSV data sources contain a first line that is a header (i.e., column names) and how it should be handled.
There are four modes:
xgt.HeaderMode.NONE
means there is no header line in the CSV file.xgt.HeaderMode.IGNORE
means there is a header line, but the server should ignore it entirely.xgt.HeaderMode.NORMAL
means map the header to the schema in a relaxed way. It will ignore columns it can’t map to property names and fill in columns that aren’t mapped with null values.xgt.HeaderMode.STRICT
means xGT will raise an exception if the schema isn’t fully mapped or if additional columns exist in the file and not in the schema. The header line can be modified to work with “strict” mode by using the special stringIGNORE
in the header to ignore a column instead of producing an error.
The load method will return a completed job that indicates the status of the load. The last thing to note is that load will ignore any empty lines it finds in the file.
4.4.1.1. Reading from the Client Filesystem¶
This method of loading data reads data from the filesystem local to the client.
The path
parameter is simply an absolute or relative path on the client’s local filesystem.
A relative path is relative to the working directory of the Python script.
This is indicated by using either no path prefix or the xgt://
protocol prefix.
If there is no protocol supplied, the default behavior is to use the xgt://
protocol.
a_frame.load('/data/path/my_vertices.csv')
another_frame.load('xgt:///data/path/my_edges.csv', '/data/path/my_other_edges.csv')
These load calls request xgt
to search the local filesystem for the files, then send them to the xGT server.
This means it will go across the network for remote servers.
Reading from the server filesystem is much faster than loading from the client filesystem as it avoids having to transmit the data across the network.
It is strongly recommended to load large files from the server filesystem.
4.4.1.2. Reading from the Server Filesystem¶
This method of loading data is a request for the xGT server to read files from the remote filesystem where xGT is running.
The path is preceded with the xgtd://
protocol telling xgt
to pass off the request to the xGT server.
Note that the actual location on the xGT server’s filesystem must be within the configured sandbox (see Enterprise Data Movement).
All paths will be relative to the root sandbox directory, and any path that escapes the sandbox is invalid.
The default sandbox location is /srv/xgtd/data/ but can be configured to be different by your system administrator.
Consult your system administrator if you do not know the location of or have access to the sandbox directory.
a_frame.load('xgtd://data/path/example.csv')
another_frame.load(['xgtd://data/path/myfirst.csv', 'xgtd://data/path/mysecond.csv'])
4.4.1.3. Reading from a URL¶
This method of loading data asks the xGT server to retrieve CSV-formatted data from a URL.
The protocol can be either http://
or https://
.
a_frame.load('http://www.example.com/data/path/example.csv')
another_frame.load(['http://www.example.com/data/myfirst.csv', 'https://www.example.com/data/mysecond.csv'])
4.4.1.4. Reading from an AWS S3 Bucket¶
This method of loading data asks the xGT server to pull CSV-formatted data directly from an AWS S3 bucket and uses the s3://
protocol.
a_frame.load('s3://my-s3-bucket/data/path/example.csv')
another_frame.load(['s3://my-s3-bucket/data/myfirst.csv', 's3://my-s3-bucket/data/mysecond.csv'])
You will need to have S3 credentials set for this (see Performance Considerations).
4.4.1.5. Inserting from Python¶
Data can also be ingested directly from a Python list by using the insert()
method of a frame object.
a_frame.insert([[0, 0, "val1"], [1, 0, "val2"], [1, 5, "val3"]])
4.4.1.6. Setting Row Labels¶
If a frame supports row security labels, they can be attached to the data during ingest. Only a user that has those security labels in their label set will be able to access that vertex, edge, or table frame row when reading data from the frame, including during a TQL query or when egesting from the frame. The label set of a user will be configured by the administrator as described in Configuring Groups and Labels.
For an ingest performed with either load()
or insert()
, it is possible to attach row labels in one of two ways.
The first option is to attach the same set of labels to each new row.
This is done by passing in the row_labels
parameter, which should be a list or set of string labels.
These labels must exist in the xgt__Label
frame, as described in Configuring Groups and Labels.
The following example shows adding security labels this way.
a_frame.load(
data = '/data/path/my_vertices.csv',
row_labels = ["label2", "label3", "label9"])
The second option is to specify different labels for each row ingested during a single load()
or insert()
call.
This is done by adding security labels as additional columns to the ingested data and using the row_label_columns
parameter to indicate which column is a label.
When calling insert()
, any non-label column must contain schema data in the appropriate order.
In the example below, the vertex with key 0 is inserted with one security label: “label1”.
The vertex with key 1 is inserted with two security labels: “label2” and “label4”.
The vertex with key 2 is inserted with no security labels.
a_frame = xgt.create_vertex_frame(
name = 'example__Name',
schema = [['id', xgt.INT], ['name', xgt.TEXT]],
key = 'id',
row_label_universe = ["label1", "label2", "label3", "label4", "label5"])
)
a_frame.insert(
data = [[0, "val1", "label1", ""], [1, "val2", "label2", "label4"], [2, "val1", "", ""]],
row_label_columns = [2, 3])
Note that the number of elements in each list of the data
parameter must be the same for each row.
To attach up to N security labels to a row for the ingest, the length of each list must be equal to the length of the frame schema plus N.
The order of the labels in each list does not matter.
When calling load()
with the row_label_columns
parameter, the CSV file can contain additional non-schema columns with security labels.
The number of columns in each row must be the same and if any row has fewer labels, that column element should be empty.
The order of the labels in each row of the file does not matter.
For example, to ingest into a vertex frame with schema [['id', xgt.INT], ['name', xgt.TEXT]]
, the CSV file with header might look like this:
id, name, label_col1, label_col2, label_col3
0, val1, label1,,
1, val2, label2, label4,
2, val1,,,
3, val3, label9, label8, label1
If the header mode is xgt.HeaderMode.NONE
or xgt.HeaderMode.IGNORE
, the row_label_columns
parameter, if passed in, must be a list of integer column indices, indicating those columns that contain security labels.
For example, the data in the sample file above would be ingested as follows:
a_frame.load(
data = '/data/path/my_vertices.csv',
headerMode = xgt.HeaderMode.IGNORE,
row_label_columns = [2, 3, 4])
If the header mode is xgt.HeaderMode.NORMAL
or xgt.HeaderMode.STRICT
, then the header column names help determine the label columns.
In this case, the row_label_columns
parameter must be a list of strings, indicating the names of columns that contain security labels:
a_frame.load(
data = '/data/path/my_vertices.csv',
headerMode = xgt.HeaderMode.NORMAL,
row_label_columns = ["label_col1", "label_col2", "label_col3"])
Note that the security label columns in the ingested CSV file can be any columns.
They do not need to be placed at the end after schema columns.
Only one of row_label_columns
and row_labels
can be passed into a load()
or insert()
call.
4.4.1.7. Reading Into Edge Frames¶
Edges created from an edge frame load()
or insert()
will contain references to their respective source and target vertex endpoints.
For row access control, the user will need access to read any endpoints referenced by an edge.
If any edge references a non-existent endpoint, these source or target endpoints will be automatically created. If access is restricted to the vertex frames, the user will need to have access to create these endpoints regardless if endpoints are created or not.
For row access control, row labels can be attached to these automatically created endpoints via the parameters source_vertex_row_labels
and target_vertex_row_labels
on load or insert.
The following example shows adding security labels this way.
a_frame.load(
data = '/data/path/my_edges.csv',
source_vertex_row_labels = ["label2", "label3", "label9"],
target_vertex_row_labels = ["label1", "label3", "label5"])
If the source and target vertex frames are the same for an edge frame, the source and target vertex security labels need to be the same on insert or load.
4.4.1.8. Handling Ingest Errors¶
There are a wide range of errors that may result in having an ingest operation not complete as intended.
One type of error is the inability to find the file.
This could occur if providing an incorrect path to the file.
Or it could be that the userid of the running xGT server (xgtd
) does not have read permission on the file when the xgtd://
protocol is specified.
If a correct path is given and the file permissions are satisfied, many possible errors remain, for example:
The frame’s schema could be wrong by having too many or too few columns
The frame’s schema could be wrong by having the wrong data type given for one or more of the columns
The CSV file may have a non-standard value delimiter (e.g.,
'|'
rather than','
)The process used to build the CSV file may have incorrectly formatted certain data types (e.g., not surrounding a string containing the delimiter character with double quotes)
With these kinds of ingestion errors based on the CSV format and frame’s schema, xGT may ingest only some of the rows in the CSV file. If that occurs, there will be a Python exception raised with a message describing the outcome of the ingest operation (e.g., the number of lines that led to ingest errors) and a detailed message with up to 10 errors explaining what xGT perceived went wrong. These error messages are produced based on limited understanding of what is really wrong. For example, it is not possible to know whether a row in the CSV is ill-formed or if the frame’s schema was created incorrectly. So, these error messages are intended to provide the user as much guidance as possible on how to quickly determine the cause of the problem.
If an ingest operation encounters no errors, the outcome will be that all of the ingested data is added to the graph. But if an ingest operation does encounter one or more errors, xGT will properly insert all of the ingested data without errors and raise an exception to inform the user of the errors. In this case of some data being correct and some data having errors, the updated frame is ready to be used; it is just missing rows for the lines that led to ingest errors. If all of the ingested data leads to errors, then the frame will remain unchanged. This could happen, for example, if a frame is created with an incorrect schema.
4.4.1.9. Digging Deeper into Ingest Errors¶
If a user wishes to see more than 10 error messages, xGT does provide a way to access up to 1,000 messages. Note that for situations such as a bad schema, when trying to ingest a 1,000,000-row CSV file, there is no way to retrieve all 1,000,000 error messages.
To see all of the first 1,000 errors, the user’s script will need to catch the raised XgtIOError
exception.
The exception object has a property that represents the job
object for the ingest operation.
And the first 1,000 errors are available from that job
object.
These errors can be retrieved via get_data()
or get_data_pandas()
.
The error rows will contain a list of lines that failed with corresponding error messages. Each row corresponds to a line that had an ingest error. The first column contains the error message. The second column contains the text of the line that failed to ingest, if available. The remaining columns contain the data that was encoded before the failure occurred. This will usually be empty since most errors occur before encoding.
try:
job = a_frame.load('test.csv')
except xgt.XgtIOError as e:
error_rows = e.job.get_data()
# Either print the exception object and the error_rows data structure,
# or perform some operations over the error_rows data structure.
4.4.2. Getting Data out of xGT¶
The save()
method of a frame object is used to request xGT to write data into a CSV file on either the client or server filesystem.
The signature of the save method is: save(path, offset=0, length=None, headers=False, ...)
, and is called on a frame object the same as load.
There are only two variants of the path for a save method: the client filesystem and the server filesystem.
By default when saving, ordering of data on the server may not be preserved for performance reasons.
To always guarantee the ordering, set the parameter preserve_order
to true.
Note that if a frame has row security protection, not all data in a frame may be visible. As described in Security Labels, an authenticated user has security labels that changes the visibility of data. A Python user can only see or egest frame rows whose security labels are a subset of the labels in the user’s label set (user labels).
4.4.2.1. Saving to the Client Filesystem¶
This method of saving data writes frame data to a file on the filesystem local to the client.
The user provides an absolute or relative path to the output file on the client’s local filesystem.
This is indicated by using either no path prefix or the xgt://
protocol prefix.
If there is no protocol supplied, the default behavior is to use the xgt://
protocol.
frame.save('xgt:///data/path/example.csv')
frame.save('../data/path/example.csv')
frame.save('data/interesting.data.csv', offset = 10, length = 100, headers = True)
Note that this will send data from the server to the client before writing the file which is much slower than having the server write the file to disk, even if the client is running on the same machine as the server.
It is recommended that only smaller datasets are saved directly to the client filesystem.
The third frame.save()
example shows how to limit the number of rows saved.
It pulls at most 100 rows starting from row 10.
It also demonstrates how to save the column names included as a header.
A saved file can be read directly into some other analytic tool such as MS Excel.
4.4.2.2. Saving to the Server Filesystem¶
This method of saving data is a request for the xGT server to write frame data to the remote filesystem where xGT is running.
The user provides a relative path on the server’s filesystem using the xgtd://
protocol.
As with loads the actual location on the xGT server’s filesystem must be within the configured sandbox.
All paths will be relative to the root sandbox directory, and any path that escapes the sandbox is invalid.
frame.save('xgtd://data/path/example.csv')
frame.save('xgtd://data/interesting.data.csv', offset = 10, length = 1000000, headers = True)
This method should be used when you know the size of the data is prohibitively large. It is certainly possible to copy the data elsewhere after saving to the server filesystem. For large datasets it is usually faster to save the file on the server and copy it to the client system than to have xGT directly save the file to the client system.
4.4.2.3. Saving to Python¶
To easily get a small amount of frame data into Python, the get_data()
or get_data_pandas()
method of a frame object can be called.
This returns frame data as a list of lists or as a pandas frame.
Data from these methods will always be the same order as they are stored on the server.
4.4.2.4. Saving Row Labels¶
If a frame supports row labels, those labels can be egested along with the frame data by setting the parameter include_row_labels
to true.
By default it is set to false and no labels are egested.
When retrieving data with get_data()
passing include_row_labels = True, the labels for each row will be appended as columns following the data columns.
Each row will have one column per security label found in row_label_universe
.
In other words, there will be one column per security label that is both in the frame’s row label universe and in the user’s label set.
If the corresponding label is not attached to that row, the column will be the empty string.
The example below shows retrieving data from a frame with a row label universe of “label1”, “label2”, “label4”.
a_frame = xgt.create_vertex_frame(
name = 'example__Name',
schema = [['id', xgt.INT], ['name', xgt.TEXT]],
key = 'id',
row_label_universe = ["label1", "label2", "label4"])
)
a_frame.insert(
data = [[0, "val1", "label1", ""], [1, "val2", "label2", "label4"], [2, "val1", "", ""]],
row_label_columns = [2, 3])
data = a_frame.get_data(include_row_labels = True)
# data is:
# [[0, 'val1', 'label1', '', ''], [1, 'val2', '', 'label2', 'label4'], [2, 'val1', '', '', '']]
When egesting with save()
, security labels are written as additional columns to the CSV file if the parameter include_row_labels
is set to true.
Each row in the file will have one column per security label in the frame’s row label universe visible to the user.
The parameter row_label_column_header
can be passed in to indicate the header column name for any label column in the file.
Otherwise, the default label column name is “ROWLABEL”.
This example egests a frame with row labels to a file.
a_frame = xgt.create_vertex_frame(
name = 'example__Name',
schema = [['id', xgt.INT], ['name', xgt.TEXT]],
key = 'id',
row_label_universe = ["label1", "label2", "label4"])
)
a_frame.insert(
data = [[0, "val1", "label1", ""], [1, "val2", "label2", "label4"], [2, "val1", "", ""]],
row_label_columns = [2, 3])
a_frame.save("FrameOutput.csv", include_row_labels = True,
row_label_column_header = "security_label")
The resulting content of “FrameOutput.csv” is shown below:
id, name, security_label, security_label, security_label
0, val1, label1, ,
1, val2, ,label2, label4
2, val1, , ,
The include_row_labels
parameter can also be passed to get_data_pandas()
so that the labels are returned as additional columns in the Pandas DataFrame.
The optional parameter row_label_column_header
is used to set the name of the label columns, which is by default “ROWLABEL”.
pandas_data_frame = a_frame.get_data_pandas(include_row_labels = True)
4.4.3. Performance Considerations¶
For filesystem loads and saves, reading or writing from the server is orders of magnitude faster than from the client. This is true even if the client is running on the same machine as the server. The server is able to employ parallelism that isn’t available in the Python client when loading and saving files. Additionally, data must be serialized, sent through a gRPC channel, and unserialized when transferring between client and server.
4.4.4. S3 Credentials¶
When loading a file using the load()
method and an s3:
protocol prefix, xGT will check the xgtd user’s home for the .aws/credentials
file with the variables aws_access_key_id
and aws_secret_access_key
.
If the credentials file doesn’t exist, xGT will check for the configuration variables aws.access_key_id
and aws.secret_access_key
.
The online AWS Access Keys document explains what these keys are and contains references to learn how to create and manage them.
(These values may also be specified at runtime in user code by passing them to a Connection
object.)
When reading the .aws/credentials
file, xGT also supports profile selection via the AWS_PROFILE
environment variable.
If no environment variable is found, xGT will use the default
profile.