Managing data in xGT
The xGT tool implements a strongly typed, property graph data model. The fundamental building blocks of this model are frames: tabular representations of data with prescribed data types. In order to create a frame, a user specifies the names and data types of each property (called a schema) as well as any other information specific to that structure (for instance, the key property for a vertex frame).
When a user wants to load data, they do so by loading it directly into an existing frame, and xGT provides support for a variety of different data sources and loading options.
Users can get frame objects either from the output of the
create_*_frame methods or by explicitly calling the
get_*_frame method with the name or names of the frames the user wants.
# *conn* is a connection to the server. rep_to = conn.get_edge_frame('ReportsTo')
The rest of this document assumes you have set up a graph component in a Python variable in a way that is similar to
Getting data into xGT
There are two ways to get data into xGT: across a network and from a filesystem.
All forms of getting data into xGT can be expressed by calling the
load() method in the
The signature of the load method is:
entity.load(paths, headerMode=xgt.HeaderMode.NONE), where
entity may be a table, vertex, or edge frame.
paths parameter describes where xGT can find the CSV file or files and uses an optional protocol as a prefix to indicate what system the file is located on.
headerMode parameter is a flag indicating whether or not the CSV data sources contain a first line that is a header (i.e., column names) and how it should be handled.
There are four modes:
xgt.HeaderMode.NONEmeans there is no header line in the CSV file.
xgt.HeaderMode.IGNOREmeans there is a header line, but the server should ignore it entirely.
xgt.HeaderMode.NORMALmeans map the header to the schema in a relaxed way. It will ignore columns it can't map to property names and fill in columns that aren't mapped with null values.
xgt.HeaderMode.STRICTmeans xGT will raise an exception if the schema isn't fully mapped or if additional columns exist in the file and not in the schema. The header line can be modified to work with "strict" mode by using the special string
IGNOREin the header to ignore a column instead of producing an error.
1. Reading from the client filesystem
This method is the most straightforward.
path parameter is simply an absolute or relative path on the client's local filesystem.
vertex_frame.load('/data/path/my_vertices.csv') edge_frame.load('xgt:///data/path/my_edges.csv', '/data/path/my_other_edges.csv')
These load calls request
xgt to search the local filesystem for the files, then send them to the xGT server.
Note that if the server is running on a separate system, the data will be shipped over the network as part of these method calls.
If there is no protocol supplied, the default behavior is to use the
xgt:// protocol, which indicates the client should look on its own local filesystem for the file specified in the path and send that to the xGT server.
2. Reading from the server filesystem
This method is essentially a request for the xGT server to look for and ingest the CSV files from the remote filesystem where xGT is running.
The path is preceded with the
xgtd:// protocol telling
xgt to pass off the request to the xGT server.
frame.load('xgtd://data/path/example.csv') frame.load(['xgtd://data/path/myfirst.csv', 'xgtd://data/path/mysecond.csv'])
3. Reading from a URL
This method asks the xGT server to retrieve CSV-formatted data from a URL.
The protocol can be either
frame.load('http://www.example.com/data/path/example.csv') frame.load(['http://www.example.com/data/myfirst.csv', 'https://www.example.com/data/mysecond.csv'])
4. Reading from an AWS S3 bucket
This method asks the xGT server to pull CSV-formatted data directly from an AWS S3 bucket and uses the
frame.load('s3://my-s3-bucket/data/path/example.csv') frame.load(['s3://my-s3-bucket/data/myfirst.csv', 's3://my-s3-bucket/data/mysecond.csv'])
5. Handling ingest errors
If an error occurs during ingesting a valid file, the CSV reader will create an error table and store the name of the error table in the frame property
error_frame_name of the frame ingested into.
The error table will contain a list of lines that failed with corresponding error messages.
Python will throw an error containing the table name when this happens.
At this point the error table can be read.
Each row of the table corresponds to a line that failed to read. The first column contains the error message. The second column contains the line that failed to ingest if available. The next columns contain the data that was encoded before the failure occurred. This will usually be empty since most errors occur before encoding.
frame = conn.get_table_frame('Employee') frame.load('test.csv') ... # An error occurs and exception gets thrown during the load. error_table = self.conn.get_table_frame(frame.error_frame_name)
Getting data out of xGT
save() method is used to request xGT to write data into a CSV file on either the client or server filesystem.
The signature of the save method is:
entity.save(path, offset=0, length=None, headers=False), where
entity may be a table, vertex, or edge.
This is the preferred method for extracting the result of a query (saving from the result table of a
There are only two variants of the path for a save method.
1. Saving to the client filesystem
This method simply provides an absolute or relative path to the client's local filesystem.
frame.save('../data/path/example.csv') frame.save('../data/interesting.data.csv', offset=10, length=100, headers=True)
Note that the second
frame.save() pulls at most 100 rows, starting from row 10 of the frame's table, and creates a CSV file with the column names included.
This file can be read directly into some other analytic tool such as MS Excel.
2. Saving to the server filesystem
This method provides an absolute or relative path on the server's filesystem as indicated by the
frame.save('xgtd://../data/path/example.csv') frame.save('xgtd://../data/interesting.data.csv', offset=10, length=1000000, headers=True)
This method should be used when you know the size of the data is prohibitively large. It is certainly possible to copy the data elsewhere after saving to the server filesystem.