2.5. Data Movement

The xGT tool implements a strongly typed, property graph data model. The fundamental building blocks of this model are frames: tabular representations of data with prescribed data types. In order to create a frame, a user specifies the names and data types of each property (called a schema) as well as any other information specific to that structure (for instance, the key property for a vertex frame). For more details see: create_table_frame(), create_vertex_frame(), and create_edge_frame().

When a user wants to load data, they do so via an existing frame. xGT provides support for a variety of different data sources and loading options on these frames. Users can get frame objects either from the output of the create_*_frame() methods or by explicitly calling the get_frame() method with the name or names of the frames the user wants. More detailed information about getting frames can be found in the API documentation: get_frame() and get_frames().

Here is an example of getting an edge frame and assigning it to a frame object named a_frame:

# *conn* is a connection to the server.
a_frame = conn.get_frame('Name')

In addition to supporting direct data transfers to frames via the xGT Python client API, the xgtd server also offers an Apache Arrow Flight endpoint. Frames are mapped to individual Arrow Flight objects. More information is available in Section Arrow Flight Server Endpoint.

2.5.1. Getting Data into xGT

There are multiple ways to get data into xGT: from a filesystem, from a network location, and from a Python object. Getting data into xGT from a filesystem or network location can be expressed by calling load() on a respective frame object. Getting data from a Python object is accomplished by calling insert() on a respective frame object.

The signature of the load method is: load(paths, header_mode = xgt.HeaderMode.NONE, ...) and is called on a frame object returned from the create_*_frame() or get_frame() methods. Here is an example of this:

a_frame = conn.get_frame('Name')
job = a_frame.load('/data/path/my_edges.csv')

The load method can load CSV, Parquet files, or compressed CSV files. However, there are some limitations for compressed CSV files (see Compressed CSV Files). Load also provides some flexibility for ingesting a CSV file by location and via header rules. The paths parameter describes where xGT can find the CSV file or files and uses an optional protocol as a prefix to indicate what system the file is located on. This can be local to the script, local to the server, or an external server. Mixing and matching protocols is allowed in a single load, however local paths, server paths, and remote URLs will run as separate transactions. On success, the last run job will be returned. The server protocol supports specifying a directory to read which will recursively read all files in that directory. Wildcards are allowed when reading paths on the client or server filesystems. The supported wildcard operations are 0 or more characters (*), a single character (?), character set ranges ([acde] or [ac-e]), and negation on character set ranges ([!0-9]). A range set matches one of the given set of characters, and the - operator in the set specifies to match the characters in a range. For example, a-z would match lowercase letters. Negation matches any character except the given pattern.

The load method will return a completed job that indicates the status of the load.

2.5.1.1. Loading from the Client Filesystem

This method of loading data reads data from the filesystem local to the client. The path parameter is simply an absolute or relative path on the client’s local filesystem. A relative path is relative to the working directory of the Python script. This is indicated by using either no path prefix or the xgt:// protocol prefix. If there is no protocol supplied, the default behavior is to use the xgt:// protocol. Giving a directory to load is not supported for client paths, but reading a whole single directory without recursion can be done via wildcards such as, myDirectory/*.

a_frame.load('/data/path/my_vertices.csv')
another_frame.load('xgt:///data/path/my_edges.csv', '/data/path/my_other_edges.csv')
wildcard_frame.load('xgt:///data/path/*edges.csv')

These load calls request xgt to search the local filesystem for the files, then send them to the xGT server. This means it will go across the network for remote servers. Reading from the server filesystem is much faster than loading from the client filesystem as it avoids having to transmit the data across the network. It is strongly recommended to load large files from the server filesystem.

2.5.1.2. Loading from the Server Filesystem

This method of loading data is a request for the xGT server to read files from the remote filesystem where xGT is running. The path is preceded with the xgtd:// protocol telling xgt to pass off the request to the xGT server. Note that the actual location on the xGT server’s filesystem must be within the configured sandbox (see Enterprise Data Movement). All paths will be relative to the root sandbox directory, and any path that escapes the sandbox is invalid. The default sandbox location is /srv/xgtd/data/ but can be configured to be different by your system administrator. Consult your system administrator if you do not know the location of or have access to the sandbox directory.

a_frame.load('xgtd://data/path/example.csv')
another_frame.load(['xgtd://data/path/myfirst.csv', 'xgtd://data/path/mysecond.csv'])
wildcard_frame.load('xgtd:///data/path/*edges.csv')
directory_frame.load('xgtd:///data/path/')

2.5.1.3. Loading from a URL

This method of loading data asks the xGT server to retrieve CSV-formatted or Parquet data from a URL. The protocol can be either http://, https://, ftp://, or ftps://. Wildcards are not supported for URLs.

a_frame.load('http://www.example.com/data/path/example.csv')
another_frame.load(['http://www.example.com/data/myfirst.csv', 'https://www.example.com/data/mysecond.csv'])

2.5.1.4. Loading from an AWS S3 Bucket

This method of loading data asks the xGT server to pull CSV-formatted or Parquet data directly from an AWS S3 bucket and uses the s3:// protocol. Wildcards are not supported for S3.

a_frame.load('s3://my-s3-bucket/data/path/example.csv')
another_frame.load(['s3://my-s3-bucket/data/myfirst.csv', 's3://my-s3-bucket/data/mysecond.csv'])

You will need to have S3 credentials set for this (see S3 Credentials).

2.5.1.5. Inserting from the Client

Data can be inserted from Python lists, pandas DataFrames, or Apache Arrow Tables using the insert() method of a frame object.

a_frame.insert([[0, 0, "val1"], [1, 0, "val2"], [1, 5, "val3"]])

Cypher lists can be inserted by using a Python list, pandas array, or Arrow Array as a row element. In the following example the second element of the row is a list of strings:

a_frame.insert([[0, ["val1", "val2"]], [1, ["val2", "val3"]], [2, ["val3", "val4"]]])

The following table indicates the supported mappings from Python types to xGT types:

Python to xGT Type Mapping

Python Types

Boolean

Integer

Unsigned Integer

Float

Date

Time

Datetime

Duration

Ipaddress

Text

List

CartesianPoint

WGS84Point

bool

int

[1]

float

decimal.Decimal

str

datetime.date

datetime.time

datetime.datetime

datetime.timedelta

ipaddress.IPv4Address

ipaddress.IPv6Address

None

list

[2]

[2]

The following table indicates the supported mappings from pandas types to xGT types:

Pandas to xGT Type Mapping

Pandas Types

Boolean

Integer

Unsigned Integer

Float

Date

Time

Datetime

Duration

Ipaddress

Text

List

CartesianPoint

WGS84Point

boolean

int32, Int32

[3]

int64, Int64

[3]

uint32, UInt32

[3]

uint64, UInt64

[3]

float32, Float32

float64, Float64

str, string

datetime64[ns]

timedelta64[ns]

None, NaN, NaT

array

[4]

[4]

As Python lists can store any type, they can hold pandas types in addition to Python types. Pandas Series can also store any type and can hold Python types in addition to pandas types. Because pandas doesn’t have native date, time, or IP address types, the native Python types must be used when inserting from a pandas DataFrame.

See Arrow to xGT Type Mapping for mappings from Arrow Flight to xGT types.

2.5.1.6. Loading CSV Files

When loading from a CSV file, the header_mode and column_mapping parameters control how the data in the CSV file is ingested into the frame. The header_mode parameter is a flag indicating whether the CSV data sources contain a first line that is a header (i.e., column names) and how it should be handled. There are four modes:

  • xgt.HeaderMode.NONE means there is no header line in the CSV file.

  • xgt.HeaderMode.IGNORE means there is a header line, but the server should ignore it entirely.

  • xgt.HeaderMode.NORMAL means map the header to the schema in a relaxed way. If the column names in the header match the property names of the frame’s schema, then data will be automatically ingested into these properties. It will ignore columns it can’t map to property names and fill in columns that aren’t mapped with null values.

  • xgt.HeaderMode.STRICT (deprecated) means xGT will raise an exception if the schema isn’t fully mapped or if additional columns exist in the file and not in the schema. The header line can be modified to work with “strict” mode by using the special string IGNORE in the header to ignore a column instead of producing an error.

Consider this example:

id,firstName,lastName,gender,birthday
01,Alice,Roberts,unspecified,1965-04-10

This example is shown using the comma as the CSV delimiter.

If the schema has id,lastName,firstName,birthday, then the NORMAL mode will run without errors, aligning the third column as the second attribute, the second column as the third attribute, and ignoring the “gender” column. If, however, the head mode is STRICT, there would be an exception raised due to the inability to fully map the CSV columns with the frame’s schema properties. For the STRICT header mode, this schema would work fine: id,lastName,firstName,birthday,gender.

The column_mapping parameter allows the user to specify which file column should be read into which property of the frame. This is useful if the file has no header or if the header column names do not match the names of the frame’s properties.

For the following examples using column_mapping, assume a frame with the schema id,lastName,firstName,birthday and a file with the following header:

DOB,firstName,surname,location,ID

The names in the header are different than the properties of the schema, but the file can still be loaded correctly into the frame as shown below. column_mapping should be a dictionary whose keys are names of the frame property and whose values indicate which file column should be read into that property. In this example, the file column names from the file header are used:

column_map = { 'id' : 'ID', 'lastName' : 'surname', 'firstName' : 'firstName', 'birthday' : 'DOB' }
a_frame.load('example.csv', header_mode = xgt.HeaderMode.NORMAL, column_mapping = column_map)

Instead of using the name of the file column, the position of the column in the file (0-indexed) can be used:

column_map = { 'id' : 4, 'lastName' : 2, 'firstName' : 1, 'birthday' : 0 }
a_frame.load('example.csv', column_mapping = column_map)

The dictionary can also contain a mix of names and positions:

column_map = { 'id' : 4, 'lastName' : 'surname', 'firstName' : 'firstName', 'birthday' : 0 }
a_frame.load('example.csv', header_mode = xgt.HeaderMode.NORMAL, column_mapping = column_map)

The same data column can be inserted into more than one frame property. Below, the second file column is inserted into both the “lastName” and “firstName” properties.

column_map = { 'id' : 4, 'lastName' : 1, 'firstName' : 1, 'birthday' : 0 }
a_frame.load('example.csv', column_mapping = column_map)

If the column_mapping parameter is used, it should contain all frame properties into which data should be inserted. If any frame column is missing from this dictionary, null values will be inserted into it for all ingested rows. This is the case even if the file header contains a column name that matches a property name. In the example below, the “firstName” property will be null for all ingested rows:

column_map = { 'id' : 'ID', 'lastName' : 'surname', 'birthday' : 'DOB' }
a_frame.load('example.csv', header_mode = xgt.HeaderMode.NORMAL, column_mapping = column_map)

Note that the use of column_mapping with header mode STRICT is not supported. If column_mapping contains any file column names, these are obtained from the header. Therefore, in this case the header mode NORMAL must be used. If only file column positions are used, the header mode can be NORMAL, IGNORE, or NONE.

If column_mapping is not passed in and header mode IGNORE or NONE is used, then xGT will use the order of columns in the file to ingest data. The first file column will be inserted into the first property of the frame’s schema, the second into the second, and so on. The order of the schema is determined when the frame is created and can be seen in schema. Note that this automatic mapping may also be affected if the row_label_columns parameter is used to designate certain file columns as security label because any file column cannot be used as both a frame data column and a security label column (see Section Setting Row Labels for adding security labels).

The expected CSV delimiter is by default a comma, but can be set with the delimiter parameter.

The last thing to note is that load will ignore any empty lines it finds in the file.

2.5.1.6.1. Compressed CSV Files

xGT supports loading CSV files with two types of compression: bzip2 and gzip. The files need to have bz2 or gz extensions to identify the file types. The compressed files can be loaded from the server filesystem using the xgtd:// protocol or from other network locations such as the web or s3. However, loading from the client filesystem is currently not supported. In general there is limited parallelism when decompressing zipped files. If parallelism is needed, split the data into multiple files, and read multiple files simultaneously.

2.5.1.7. Loading Parquet Files

Parquet files are the columnar storage format of the Hadoop ecosystem. The data in this format can more easily compressed due to the columnar nature. Therefore, it’s used by many big-data processing frameworks.

Currently, xGT supports loading Parquet files from the server, client filesystem, URLs, and S3. In general, Parquet’s various internal compressions are supported such as: bz2, zlib, lz4, snappy, zstd and Brotli. These types of files require the parquet extension (.parquet) in order to be identified by xGT.

xGT allows for loading Parquet files where the Parquet schema is longer than the xGT frame schema. Any additional columns will be ignored. In cases where the schema is shorter or needs to be mapped, xGT supports using the column_mapping parameter when loading Parquet files to allow mapping columns in a Parquet file to columns in a frame. A detailed explanation of how mapping works can be found in Loading CSV Files. When ingesting Parquet, xGT uses the schema column names to map instead of a CSV header. Since the header modes are ignored, they do not affect the mapping behavior.

xGT supports loading Parquet data into all xGT types except the point types. The table below shows the mapping of Parquet data types to xGT types. All the types and mappings in the table below are supported for the nested list type. The following Parquet types are not supported: int96 or byte_array, and the following Parquet logical types are not supported: enumeration, UUID, JSON, BSON or maps.

The following table indicates the supported mappings from Parquet extended types (types and logical types) to xGT types:

Parquet to xGT Type Mapping

Parquet Types

Boolean

Integer

Unsigned Integer

Float

Date

Time

Datetime

Duration

Ipaddress

Text

List

Boolean

Signed/Unsigned Integers

[5]

[6]

Float/Double

Decimal

✓ (128 bit only)

✓ (128 bit only)

String

Date

Time

Date Time

Interval

[7]

Null

List

Parquet files are divided into metadata and data. Data is partitioned into chunks of rows called row groups. To allow for parallel reading of the files, use row groups of at least 1000 lines . There is a known issue with large files over ~5 gigabytes not working with Apache Arrow when the metadata becomes too large. Split up larger files or increase the row group size to avoid this issue. xGT will validate the schema conversion before ingesting the data. However, in some cases for missing data individual errors for specific rows may be returned e.g. empty key columns for an edge. In this case, the line number will be relative to the row group number.

Some parameters such as header mode or delimiter are ignored when reading Parquet files.

2.5.1.8. Encoding Text to Types

xGT supports inserting text input columns into any type. For temporal types the ISO 8601 string formats are supported. The following table gives a summary of the supported text formats for each xGT type.

Supported Text Formats

xGT Type

Text Formats

Boolean

true (case insensitive)
false (case insensitive)
1
0

Unsigned Int

Valid positive integer strings.

Int

Valid integer strings with digits and an optional leading positive or negative sign.

Float

Valid floating point strings using digits, optional decimal point, optional exponent, and optional leading positive or negative sign.
nan (case insensitive)
[+][-]inf (case insensitive)

Date

[+][-]YYYY-MM-DD
[+][-]YYYY-MM-DDTHH:MM[:SS[.ssssss]]
[+][-]YYYY-MM-DDTHH:MM[:SS[.ssssss]]Z
[+][-]YYYY-MM-DDTHH:MM[:SS[.ssssss]]+HH[:MM]
[+][-]YYYY-MM-DDTHH:MM[:SS[.ssssss]]-HH[:MM]
[+][-]YYYY-MM-DD HH:MM[:SS[.ssssss]]
[+][-]YYYY-MM-DD HH:MM[:SS[.ssssss]]Z
[+][-]YYYY-MM-DD HH:MM[:SS[.ssssss]]+HH[:MM]
[+][-]YYYY-MM-DD HH:MM[:SS[.ssssss]]-HH[:MM]

Time

[T]HH:MM[:SS[.ssssss]]
[T]HH:MM[:SS[.ssssss]]Z
[T]HH:MM[:SS[.ssssss]]+HH[:MM]
[T]HH:MM[:SS[.ssssss]]-HH[:MM]

Datetime

[+][-]YYYY-MM-DD
[+][-]YYYY-MM-DDTHH:MM[:SS[.ssssss]]
[+][-]YYYY-MM-DDTHH:MM[:SS[.ssssss]]Z
[+][-]YYYY-MM-DDTHH:MM[:SS[.ssssss]]+HH[:MM]
[+][-]YYYY-MM-DDTHH:MM[:SS[.ssssss]]-HH[:MM]
[+][-]YYYY-MM-DD HH:MM[:SS[.ssssss]]
[+][-]YYYY-MM-DD HH:MM[:SS[.ssssss]]Z
[+][-]YYYY-MM-DD HH:MM[:SS[.ssssss]]+HH[:MM]
[+][-]YYYY-MM-DD HH:MM[:SS[.ssssss]]-HH[:MM]

Duration

[+][-]P[nW][nD][T[nH][nM][nS]]
[+][-]D day[s][,] [+]H:M:S[.s]
[+][-]H:M:S[.s]

IP Address

XXX.XXX.XXX.XXX 0x0x0x0x:0x0x0x0x:0x0x0x0x:0x0x0x0x:0x0x0x0x:0x0x0x0x:0x0x0x0x:0x0x0x0x

2.5.1.8.1. Boolean

Examples of valid boolean strings are TruE, 0, and false.

2.5.1.8.2. Int

Examples of valid integer strings are 42, +56, and -1726.

2.5.1.8.3. Unsigned Int

Examples of valid unsigned integer string are 0 and 42.

2.5.1.8.4. Float

Float types are 32-bit, so they only have about 7 digits of decimal precision. Strings with more digits of precision are valid, but they will be truncated to the precision able to fit in the 32-bit float.

Examples of valid float strings are +4.726, .5728e10, 0.162E-8, Inf, and NaN.

2.5.1.8.5. Date

A date string is of the form [+][-]YYYY-MM-DD and must use 4 digits for the year and 2 digits for the month and day. A datetime string can be encoded into a date, which results in the time portion being truncated. Datetime strings are described below.

An example of a valid date string is 2018-02-20.

2.5.1.8.6. Time

A time string is of the form [T]HH:MM[:SS[.ssssss]] and must use 2 digits for the hours and minutes. The seconds are optional and must be 2 digits if given. Time strings can contain optional fractional seconds between 1 and 19 digits. For strings with 7 or more digits, the fractional seconds will be rounded to microsecond precision.

Although times do not store time zones, time strings can contain them. Times with time zones are converted to Coordinated Universal Time (UTC) and stored without time zone information. The time zone string can be Z indicating UTC. It can also be of the form +HH[:MM] or -HH[:MM], a positive or negative sign followed by 2 digits for the hour offset and an optional colon and 2 digits for the minutes offset. The hour offset must be between 14 and -14, and the minute offset must be in 15 minute increments.

Examples of valid time strings are 06:10:50, T11:42:50.0206, 06:10:50-01, and 11:42:50.0206+01:15.

2.5.1.8.7. Datetime

A datetime string is composed of a valid date string and a valid time string separated by either a T or a space. A date string can be encoded to a datetime, which results in a datetime using the given date and a time of 00:00:00.

Examples of valid datetime strings are 2018-12-20T06:10:50, 2018-12-20 06:10:50.02006, and 2018-12-20T06:10:50.02006+01:15.

2.5.1.8.8. Duration

A duration string can be in the ISO 8601 format, the Python timedelta format, or a time format.

The ISO 8601 format is of the form [+][-]P[nW][nD][T[nH][nM][nS]] where n is a value for a component and the component designators are:

  • W: weeks

  • D: days

  • H: hours

  • M: minutes

  • S: seconds

The P is required. Each component is composed of a numerical value followed by a letter designator. The numerical values do not have to fit into standard ranges for the components. For instance, giving 112 for the number of hours is valid. In general all the components are optional, but at least one must be given. If there is a time component, a T must start that section. The seconds component can have fractional seconds between 1 and 19 digits. For strings with 7 or more digits, the fractional seconds will be rounded to microsecond precision. A leading - indicates a negative duration and is applied to the entire duration. xGT does not support giving years or months as part of an ISO duration string as those values lead to ambiguous lengths of time.

Examples of valid ISO duration strings are P1W2D, -PT1H40S, and P100DT1H123M20.001234S.

The Python timedelta format is of the form [+][-]D day[s][,] [+]H:M:S[.s]. It is basically a days component followed by a colon-separated string giving hour, minute, and second components. The numerical values for the components do not have to fit into standard ranges. For instance, giving 2861 for the number of minutes is valid. A leading - indicates a negative days component, but the time component is always positive. For example, -1 day, 01:00:00 represents a negative duration of 23 hours.

Examples of valid Python timedelta duration strings are -1 day, +01:0:00, 100 days 1:23:20.001234, and +2 days, 1455:5:89.

The time format is of the form [+][-]H:M:S[.s]. It is a colon-separated string giving hour, minute, and second components. The numerical values for the components do not have to fit into standard ranges. For instance, giving 242 for the number of seconds is valid. The seconds component can have fractional seconds between 1 and 19 digits. For strings with 7 or more digits, the fractional seconds will be rounded to microsecond precision.

Examples of valid time format duration strings are +05:25:0.000005 and 1455:5:89.

2.5.1.8.9. IP Address

A valid IPv4 address string is of the form XXX.XXX.XXX.XXX where each dot-separated integer values is between 0 and 255. Numbers that can be represented with less than 3 digits do not need to give all 3 digits but can use leading 0’s.

An example of a valid IP address string is 172.16.254.1.

A valid IPv6 address string is in the form of 0x0x0x0x:0x0x0x0x:0x0x0x0x:0x0x0x0x:0x0x0x0x:0x0x0x0x:0x0x0x0x:0x0x0x0x where each colon separated group of hex values is a 16 bit integer. Sets of zeroes can be abbreviated with a double colon. For instance, 1234:abcd:0000:0000:0000:0000:0000:1234 can be abbreviated as 1234:abcd::1234.

2.5.1.9. Reading into Edge Frames

Edges created from an edge frame load() or insert() will contain references to their respective source and target vertex endpoints. If any edge references a non-existent endpoint, these source or target endpoints will be automatically created.

If xGT is used with access control, this can impact loading into edge frames. If frame level access control is used, the user will require read and create access on the source and target vertex frames. This is required regardless if new endpoints are created during the ingest or not. If row level access control is used, the user will also need row level access to the endpoint vertices referenced by an ingested edge. For more information on access control, see Access Control.

2.5.1.10. Reading into Vertex Frames

When reading into vertex frames, xGT disallows duplicate vertices. By default, if duplicate vertices are encountered, xGT will raise an exception. This behavior can be configured during load() or insert(), using the on_duplicate_keys parameter. The following values are supported:

  • error - Raise an exception if a duplicate key is found.

  • skip - Skip the row if the key is a duplicate.

  • skip_same - Skip the row if the key is a duplicate, and the row has the same values. Otherwise, raise an exception for non-matching row duplicates.

2.5.1.11. Customizing Data Read Into Frames via TQL Input Filters

xGT supports modifying input data as part of a load() or insert() by passing a TQL Fragment to the row_filter parameter. The fragment allows a user to transform the raw input data and map the input columns to a frame’s columns. A more detailed discussion on TQL fragments, their capabilities and requirements is presented in TQL Fragments.

2.5.1.12. Setting Row Labels

If a frame supports row security labels, they can be attached to the data during ingest. Only a user that has those security labels in their label set will be able to access that vertex, edge, or table frame row when reading data from the frame, including during a TQL query or when egesting from the frame. The label set of a user will be configured by the administrator as described in Configuring Groups and Labels.

For an ingest operation performed with either load() or insert(), it is possible to attach row labels in one of two ways. The first option is to attach the same set of labels to each new row. This is done by passing in the row_labels parameter, which should be a list or set of string labels. These labels must exist in the xgt__Label frame, as described in Configuring Groups and Labels, as well as be part of the row label universe of the frame. The following example shows adding security labels this way.

a_frame.load(
    paths = '/data/path/my_vertices.csv',
    row_labels = ["label2", "label3", "label9"])

The second option is to specify different labels for each row ingested during a single load() or insert() call. This is done by adding security labels as additional columns to the ingested data and using the row_label_columns parameter to indicate which column is a label.

When calling insert(), any non-label column must contain schema data in the appropriate order. In the example below, the vertex with key 0 is inserted with one security label: “label1”. The vertex with key 1 is inserted with two security labels: “label2” and “label4”. The vertex with key 2 is inserted with no security labels.

a_frame = xgt.create_vertex_frame(
    name = 'Name',
    schema = [['id', xgt.INT], ['name', xgt.TEXT]],
    key = 'id',
    row_label_universe = ["label1", "label2", "label3", "label4", "label5"])
    )
a_frame.insert(
    data = [[0, "val1", "label1", ""], [1, "val2", "label2", "label4"], [2, "val1", "", ""]],
    row_label_columns = [2, 3])

Note that the number of elements in each list of the data parameter must be the same for each row. To attach up to N security labels to a row for the ingest operation, the length of each list must be equal to the length of the frame schema plus N. The order of the labels in each list does not matter.

When calling load() with the row_label_columns parameter, the CSV file can contain additional non-schema columns with security labels. The number of columns in each row must be the same and if any row has fewer labels, that column element should be empty. The order of the labels in each row of the file does not matter. For example, to ingest into a vertex frame with schema [['id', xgt.INT], ['name', xgt.TEXT]], the CSV file with header might look like this:

id, name, label_col1, label_col2, label_col3
0, val1, label1,,
1, val2, label2, label4,
2, val1,,,
3, val3, label9, label8, label1

If the header mode is xgt.HeaderMode.NONE or xgt.HeaderMode.IGNORE, the row_label_columns parameter, if passed in, must be a list of integer column indices, indicating those columns that contain security labels. For example, the data in the sample file above would be ingested as follows:

a_frame.load(
    '/data/path/my_vertices.csv',
    header_mode = xgt.HeaderMode.IGNORE,
    row_label_columns = [2, 3, 4])

If the header mode is xgt.HeaderMode.NORMAL or xgt.HeaderMode.STRICT, then the header column names help determine the label columns. In this case, the row_label_columns parameter must be a list of strings, indicating the names of columns that contain security labels:

a_frame.load(
    '/data/path/my_vertices.csv',
    header_mode = xgt.HeaderMode.NORMAL,
    row_label_columns = ["label_col1", "label_col2", "label_col3"])

Note that the security label columns in the ingested CSV file can be any columns. They do not need to be placed at the end after schema columns. Only one of row_label_columns and row_labels can be passed into a load() or insert() call.

When ingesting into an edge frame, if a source or target vertex does not exist, it will be automatically created. Row labels can be attached to these automatically created endpoints via the parameters source_vertex_row_labels and target_vertex_row_labels on load() or insert(). The following example shows adding security labels this way.

an_edge_frame.load(
    '/data/path/my_edges.csv',
    source_vertex_row_labels = ["label2", "label3", "label9"],
    target_vertex_row_labels = ["label1", "label3", "label5"])

If the source and target vertex frames are the same for an edge frame, the source and target vertex security labels need to be the same on insert or load.

2.5.1.13. Handling Ingest Errors

There are a wide range of errors that may result in having an ingest operation not complete as intended. One type of error is the inability to find the file. This could occur if providing an incorrect path to the file. Or it could be that the UNIX user of the running xGT server (xgtd) does not have read permission on the file when the xgtd:// protocol is specified.

If a correct path is given and the file permissions are satisfied, many possible errors remain, for example:

  • The frame’s schema could be wrong by having too many or too few columns.

  • The frame’s schema could be wrong by having the wrong data type given for one or more of the columns.

  • The CSV file may have a non-standard value delimiter (e.g., '|' rather than ',').

  • The process used to build the CSV file may have incorrectly formatted certain data types (e.g., not surrounding a string containing the delimiter character with double quotes).

With these kinds of ingestion errors based on the CSV format and frame’s schema, xGT may be able to ingest only some rows in the CSV file. If that occurs, there will be a Python exception raised with a message describing the outcome of the ingest operation (e.g., the number of lines that led to ingest errors) and a detailed message with up to 10 errors explaining what xGT perceived went wrong. These error messages are produced based on limited understanding of what is really wrong. For example, it is not possible to know whether a row in the CSV is ill-formed or if the frame’s schema was created incorrectly. So, these error messages are intended to provide the user as much guidance as possible on how to quickly determine the cause of the problem.

If an ingest operation encounters no errors, the outcome will be that all the ingested data is added to the graph. But if an ingest operation does encounter an error, by default xGT will raise an exception to inform the user of the error. However, if suppress_errors is set to true on a load, and an ingest operation encounters one or more errors, xGT will insert all the ingested data without errors and raise an exception to inform the user of the errors. In this case of some data being correct and some data having errors, the updated frame is ready to be used; it is just missing rows for the lines that led to ingest errors. If all the ingested data leads to errors, then the frame will remain unchanged. This could happen, for example, if a frame is created with an incorrect schema.

2.5.1.14. Digging Deeper into Ingest Errors

If suppress_errors is on and the user wishes to see more than 10 error messages, xGT provides a way to access up to 1,000 messages. Note that for situations such as a bad schema, when trying to ingest a 1,000,000-row CSV file, there is no way to retrieve all 1,000,000 error messages.

To see all the first 1,000 errors, the user’s script will need to catch the raised XgtIOError exception. The exception object has a property that represents the job object for the ingest operation. And the first 1,000 errors are available from that job object. These errors can be retrieved via get_ingest_errors().

The error rows will contain a list of lines that failed with corresponding error messages. Each row corresponds to a line that had an ingest error. The first column contains the error message. The second column contains the file that contained the error, if available. Otherwise, it contains an empty string. For CSV files, the third column contains the line number that contained the error, if available. In insert operations, the third column will be the position in the list passed into insert. With CSV files, this number will begin at 1, and for insert operations, this will start at 0. If the position or line number is not available, this will be -1. The fourth column contains the text of the line that failed to ingest, if available. Otherwise, it contains an empty string. The remaining columns contain the data that was encoded before the failure occurred. This will usually be empty since most errors occur before encoding.

try:
  job = a_frame.load('test.csv', suppress_errors = True)
except xgt.XgtIOError as e:
  error_rows = e.job.get_ingest_errors()
  # Either print the exception object and the error_rows data structure,
  # or perform some operations over the error_rows data structure.

2.5.2. Getting Data out of xGT

The save() method of a frame object is used to request xGT to write data into a file on either the client, server filesystem or S3 bucket. Currently saving supports both CSV and Parquet. However, saving compressed CSVs is not supported. To save as a Parquet file use the .parquet extension when specifying the path. There are some limitations for Parquet files (see Saving Parquet Files). The signature of the save method is: save(path, offset=0, length=None, headers=False, ...), and is called on a frame object the same as load. There are only two variants of the path for a save method: the client filesystem and the server filesystem. When saving, the ordering of data on the server may not be preserved for performance reasons. To always guarantee the ordering, set the parameter preserve_order to true. The CSV delimiter is by default a comma, but can be set with the delimiter parameter.

Note that if a frame has row security protection, not all data in a frame may be visible. As described in Security Labels, an authenticated user has security labels that changes the visibility of data. A Python user can only see or egest frame rows whose security labels are a subset of the labels in the user’s label set (user labels).

2.5.2.1. Saving to the Client Filesystem

This method of saving data writes frame data to a file on the filesystem local to the client. The user provides an absolute or relative path to the output file on the client’s local filesystem. This is indicated by using either no path prefix or the xgt:// protocol prefix. If there is no protocol supplied, the default behavior is to use the xgt:// protocol.

frame.save('xgt:///data/path/example.csv')
frame.save('../data/path/example.csv')
frame.save('data/interesting.data.csv', offset = 10, length = 100, headers = True)

Note that this will send data from the server to the client before writing the file which is much slower than having the server write the file to disk, even if the client is running on the same machine as the server. It is recommended that only smaller datasets are saved directly to the client filesystem. The third frame.save() example shows how to limit the number of rows saved. It pulls at most 100 rows starting from row 10. It also demonstrates how to save the column names included as a header.

A saved file can be read directly into some other analytic tool such as MS Excel.

2.5.2.2. Saving to the Server Filesystem

This method of saving data is a request for the xGT server to write frame data to the remote filesystem where xGT is running. The user provides a relative path on the server’s filesystem using the xgtd:// protocol. As with loads the actual location on the xGT server’s filesystem must be within the configured sandbox. All paths will be relative to the root sandbox directory, and any path that escapes the sandbox is invalid. When saving to the server you can also split the save into multiple files by using the parameter, number_of_files. This will append a number between the extension and the rest of the file to indicate the part.

frame.save('xgtd://data/path/example.csv')
frame.save('xgtd://data/interesting.data.csv', offset = 10, length = 1000000, headers = True)
frame.save('xgtd://data/path/example.parquet', number_of_files = 2)

This method should be used when you know the size of the data is prohibitively large. It is certainly possible to copy the data elsewhere after saving to the server filesystem. For large datasets it is usually faster to save the file on the server and copy it to the client system than to have xGT directly save the file to the client system.

2.5.2.3. Saving to an AWS S3 Bucket

This method of saving data writes frame data to a bucket on S3. This is indicated by using the s3:// protocol prefix.

frame.save('s3://bucket/path/example.csv')
frame.save('s3://bucket/path/example.parquet', number_of_files = 2)

You will need to have S3 credentials set for this (see S3 Credentials).

2.5.2.4. Saving to the Client

Data can be stored in a xgt.TableFrame, xgt.VertexFrame, xgt.EdgeFrame, or xgt.Job. To download data to the Python client, the get_data() method of a frame or job object can be called. These functions return data as a list of lists, a pandas DataFrame, or an Apache Arrow Table. Data from these methods will always be in the same order as they are stored on the server.

The types will be returned as the native Python type, pandas types, or Arrow types corresponding to the xGT type. Since pandas doesn’t have date, time, or IP address types, the native Python types are returned in pandas Series of ‘object’ dtype for those types. Since Arrow doesn’t have an IP address type, IP addresses are returned as strings.

A list of mappings from xGT types to Python and pandas types is provided below. The pandas type is given as the dtype of the Series. When the dtype is object, the type of the values in the Series is also given as dtype::value_type.

xGT to Python/Pandas Type Mapping

xGT Type

Python Type

Pandas Type

Observations

BOOLEAN

bool

bool

INT

int

int64

FLOAT

float

float32

DATE

date

object::date

In datetime module.

TIME

time

object::time

In datetime module.

DATETIME

datetime

datetime64[ns]

In datetime module.

DURATION

timedelta

timedelta64[ns]

In datetime module.

IPADDRESS

IPv4Address

object::IPv4Address

In ipaddress module.

TEXT

str

object::str

LIST

list

object::array

The element type will use the above mappings when converting and the nesting levels will be equivalent.

See xGT to Arrow Type Mapping for mappings from xGT to Arrow Flight types.

2.5.2.5. Saving Parquet Files

xGT supports writing Parquet files on the server and client filesystem. The parameter headers isn’t supported for Parquet. When writing Parquet files, xGT will use Snappy compression, a row group size of 10,000, and the column names will correspond to the names of the columns in the frame.

xGT maps its native data types to Parquet types and logical types in the following manner (See Parquet Logical Types for more information about how typing works in Parquet files.):

xGT to Parquet Type Mapping

xGT Type

Parquet Type

Observations

BOOLEAN

BOOLEAN

1-bit boolean.

INT

Logical INT(64, true)

Signed 64-bit integer.

UINT

Logical INT(64, false)

Unsigned 64-bit integer.

FLOAT

FLOAT

IEEE 32-bit floating point number.

DATE

Logical DATE

32-bit int representation.

TIME

Logical TIME(true, MICROS)

64-bit time representation with microsecond precision and UTC adjustment.

DATETIME

Logical TIMESTAMP(true, MICROS)

64-bit timestamp representation with microseconds since Unix epoch and UTC adjustment.

DURATION [8]

Logical INT(64, true)

64-bit duration representation as microseconds.

IPADDRESS

Logical STRING

UTF-8 encoded byte array.

TEXT

Logical STRING

UTF-8 encoded byte array.

LIST

List

The element type will use the above mappings when converting and the nesting levels will be equivalent.

2.5.2.6. Saving Row Labels

If a frame supports row labels, those labels can be egested along with the frame data by setting the parameter include_row_labels to true. By default, it is set to false and no labels are egested.

When retrieving data with get_data() passing include_row_labels = True, the labels for each row will be appended as columns following the data columns. Each row will have one column per security label found in row_label_universe. In other words, there will be one column per security label that is both in the frame’s row label universe and in the user’s label set. If the corresponding label is not attached to that row, the column will be the empty string. The example below shows retrieving data from a frame with a row label universe of “label1”, “label2”, “label4”.

a_frame = xgt.create_vertex_frame(
    name = 'Name',
    schema = [['id', xgt.INT], ['name', xgt.TEXT]],
    key = 'id',
    row_label_universe = ["label1", "label2", "label4"])
    )
a_frame.insert(
    data = [[0, "val1", "label1", ""], [1, "val2", "label2", "label4"], [2, "val1", "", ""]],
    row_label_columns = [2, 3])

data = a_frame.get_data(include_row_labels = True)
# data is:
# [[0, 'val1', 'label1', '', ''], [1, 'val2', '', 'label2', 'label4'], [2, 'val1', '', '', '']]

When egesting with save(), security labels are written as additional columns to the CSV file if the parameter include_row_labels is set to true. Each row in the file will have one column per security label in the frame’s row label universe visible to the user. The parameter row_label_column_header can be passed in to indicate the header column name for any label column in the file. Otherwise, the default label column name is “ROWLABEL”. This example egests a frame with row labels to a file.

a_frame = xgt.create_vertex_frame(
    name = 'Name',
    schema = [['id', xgt.INT], ['name', xgt.TEXT]],
    key = 'id',
    row_label_universe = ["label1", "label2", "label4"])
    )
a_frame.insert(
    data = [[0, "val1", "label1", ""], [1, "val2", "label2", "label4"], [2, "val1", "", ""]],
    row_label_columns = [2, 3])

a_frame.save("FrameOutput.csv", include_row_labels = True,
             row_label_column_header = "security_label")

The resulting content of “FrameOutput.csv” is shown below:

id, name, security_label, security_label, security_label
0, val1, label1, ,
1, val2, ,label2, label4
2, val1, , ,

The include_row_labels parameter can also be passed to get_data() so that the labels are returned as additional columns in the pandas DataFrame. The optional parameter row_label_column_header is used to set the name of the label columns, which is by default “ROWLABEL”.

pandas_data_frame = a_frame.get_data(format = 'pandas',
                                     include_row_labels = True)

2.5.3. Performance Considerations

For filesystem loads and saves, reading or writing from the server is orders of magnitude faster than from the client. This is true even if the client is running on the same machine as the server. The server is able to employ parallelism that isn’t available in the Python client when loading and saving files. Additionally, data must be serialized, sent through a gRPC channel, and unserialized when transferring between client and server.

2.5.4. S3 Credentials

When loading or saving a file using the load() or save() method and an s3: protocol prefix, xGT will check the xgtd user’s home directory on the server for the .aws/credentials file with the variables aws_access_key_id and aws_secret_access_key. If the credentials file doesn’t exist, xGT will check for the configuration variables aws.access_key_id and aws.secret_access_key. The online AWS Access Keys document explains what these keys are and contains references to learn how to create and manage them. (These values may also be specified at runtime in user code by passing them to a Connection object.)

When reading the .aws/credentials file, xGT also supports profile selection via the AWS_PROFILE environment variable. If no environment variable is found, xGT will use the default profile.

The Connection object also supports temporary credentials and AWS role-based access. To utilize temporary credentials, including the session token, they must be provided to the connection object directly. For role-based access, simply pass the temporary credentials obtained from the role to the xGT connection.

Below is an example demonstrating how to use the AWS Python SDK (‘Boto3’) with a role:

sts_client = boto3.client('sts')

assumed_role_object = sts_client.assume_role(
  RoleArn="arn:aws:iam::my-account:role/my-role",
  RoleSessionName="AssumeRoleSession"
)

credentials = assumed_role_object['Credentials']

connection = xgt.Connection(
  flags = {
    'aws_access_key_id' : credentials['AccessKeyId'],
    'aws_secret_access_key' : credentials['SecretAccessKey'],
    'aws_session_token' : credentials['SessionToken']
  }
)

2.5.5. Arrow Flight Server Endpoint

In addition to the native data movement capabilities provided by xGT’s client and server, xGT also provides an Arrow Flight Server endpoint. The Flight server endpoint is serviced from the same gRPC port as xGT’s native data movement (4367). Arrow clients that use the Arrow port on the xGT server can be written in different languages, including Python (pyarrow), C++, Java, and many other languages used for data analytics.

Each frame that the user has access to is mapped to an individual Flight object, using a two-level path Flight descriptor corresponding to the namespace and the name of each frame (e.g. graph__VertexFrame is mapped to a path descriptor for (graph, VertexFrame). Note that frames must be specified using their fully qualified name, that is including their namespace, when being accessed through the Arrow endpoint.

The frame’s xGT schema is mapped to a compatible schema using Arrow data types.

Currently, xGT’s Arrow Flight server implements get and put capabilities to each individual frame/Flight according to the user’s access permissions. xGT’s Arrow Flight server does not currently support exchange or action protocols. It is convenient to invoke the list flights action on xGT’s Arrow server to list all frames that the user has access to with their mappings to Flight paths, their Arrow Flight schemas as well as other relevant properties. More information on Arrow’s client and server interfaces is available at: Arrow Python bindings (for Python), as well as Arrow documentation (for more general documentation and other languages).

Data I/O operations through xGT’s Arrow server utilize the same authentication, access control and transactional backend components as xGT’s native I/O operations. This allows for safe, authenticated, concurrent access to frames via Arrow Flight operations, native I/O operations, as well as read-only and read-write TQL queries.

It is possible to have simultaneous connections to a single xGT server through the Arrow client and through the native xGT client. The following example connects to xGT’s Arrow Flight server using the Arrow Python client pyarrow and the native xGT client xgt.

# Connect to the xgtd server through Arrow Flight.  The default xGT server
# port is 4367.
arrow_conn = pyarrow.flight.FlightClient(("localhost", 4367))
# Supply valid authentication credentials.
# basic_client_auth_handler must be a client authentication handler that
# implements the pyarrow.flight.ClientAuthHandler interface.
arrow_conn.authenticate(basic_client_auth_handler)

# Create a regular connection to the xgtd server, also supplying valid
# credentials.
conn = xgt.Connection()

# Create a frame on the xGT server.
frame = conn.create_table_frame(
  name   = 'graph__Table',
  schema = [['col0', xgt.INT]])

# Insert some data into the frame using xGT's native I/O protocol.
row_data = [ [x] for x in range(0, 100) ]
frame.insert(row_data)

# Now get the data back from xGT in Arrow format from the Arrow Flight server
# endpoint.
xgt_table = arrow_conn.do_get(pyarrow.flight.Ticket(b"graph__Table")).read_all()

# Convert the retrieved Arrow-formatted data into a pandas DataFrame.
pandas_table = xgt_table.to_pandas()

The example presented above illustrates using the native xGT client to create a frame and write data and using the Arrow client to read data. Of particular interest is the fact that the example maintains two separate connections to the server: an Arrow connection and a native xGT client connection.

xGT maps its native data types to Arrow Flight data types in the following manner during a do_get:

xGT to Arrow Type Mapping

xGT Type

Arrow Type

Observations

BOOLEAN

boolean

INT

int64

Signed 64-bit integer.

UINT

uint64

Unsigned 64-bit integer.

FLOAT

float32

32-bit floating point number.

DATE

date32

32-bit date.

TIME

time64[us]

64-bit time with microsecond precision.

DATETIME

timestamp[us]

64-bit timestamp with microsecond precision.

DURATION

duration[us]

64-bit duration with microsecond precision.

IPADDRESS

string

String representation of an IPv4 address.

TEXT

string

utf8 string.

LIST

list

The element type will use the above mappings when converting and the nesting levels will be equivalent.

xGT can map the following Arrow Flight data types to xGT types during a do_put:

Arrow to xGT Type Mapping

Arrow Types

Boolean

Integer

Unsigned Integer

Float

Date

Time

Datetime

Duration

Ipaddress

Text

List

boolean

int(64, 32, 16, 8)

uint(64, 32, 16, 8)

float(64, 32)

decimal(256, 128)

✓ (128 bit only)

✓ (128 bit only)

string/large_string/utf8

date(64, 32)

time(64, 32)

timestamp

duration

null

list

The example below illustrates the use of Arrow’s put capability to send some local data (stored in Arrow format) to xGT’s Arrow endpoint for the frame graph__Table. It also illustrates how xGT types are mapped to Arrow types.

# Create a frame on the xGT server.
frame = conn.create_table_frame(
  name   = 'graph__Table',
  schema = [['col0', xgt.INT], ['col1', xgt.TEXT]])

# Set up a local Arrow table to be sent to the server.

# Note that Arrow data is stored in columnar format ("column-major") instead
# of row format ("row-major").
data = [ [ i for i in range(num_rows) ],
         [ "val" + str(i) for i in range(num_rows) ] ]

fields = [ pyarrow.field('col0', pyarrow.int64()),
           pyarrow.field('col1', pyarrow.utf8()) ]

table = pyarrow.Table.from_arrays(data, schema = pyarrow.schema(fields))

writer, _ = arrow_conn.do_put(
  pyarrow.flight.FlightDescriptor.for_path('graph', 'Table'), table.schema)
writer.write_table(table)
writer.close()

Additional options to configure how the data transfers through xGT’s Arrow endpoint are provided by passing in those options as part of the Flight descriptor ticket. When calling the method to build a Flight descriptor for put, additional parameters are available that allows for configuration options. These parameters are specified at the end of the path. An example would be disabling implicit vertex creation when reading in data for an edge frame/Flight:

writer, _ = arrow_conn.do_put(
pyarrow.flight.FlightDescriptor.for_path('ns', 'e1', '.implicit_vertices=false'), e_table.schema)
# Multiple parameters are passed like so:
writer, _ = arrow_conn.do_put(
pyarrow.flight.FlightDescriptor.for_path('ns', 'e1', '.implicit_vertices=false', '.label_column_indices=1,4'), e_table.schema)

The following I/O options are supported by xGT’s Arrow endpoint when writing frames:

Arrow I/O put options for xGT

Option

Description

.suppress_errors

If set to true, will ingest all valid rows. If errors are encountered, it will raise an exception after all rows have been ingested with up to 1000 errors. Otherwise will raise on the first error encountered and not ingest the data. By default, this is false.

.on_duplicate_keys

If set for vertex frames, will change the error behavior when encountering a duplicate key. Allowed values are : ‘error’, raise an Exception when a duplicate key is found, ‘skip’, skip duplicate keys without raising, ‘skip_same’, skip duplicate keys if the row is exactly the same without raising. By default, this is ‘error’.

.implicit_vertices

If set to true, then vertex instances are created if they don’t exist when reading in data for an edge frame/Flight. By default, this is true.

.labels

List of row labels to add for row-level data when reading in data for a frame/Flight. This is optional.

.label_column_names

List of column names that store labels for row-level data when reading in data for a frame/Flight. Both labels and label column names cannot be set. This is optional.

.implicit_source_vertex_labels

List of the default row-level labels used for implicit source vertex instances. This is optional.

.implicit_target_vertex_labels

List of the default row-level labels used for implicit target vertex instances. This is optional.

.label_column_indices

List of the column positions that store labels for row-level data when reading in data for a frame/Flight. This is optional.

.map_column_names

List of mappings in the format key:value where key is the Arrow column name to map to the frame column name. Example: [arrow_col1:xgt_col2, arrow_col2:xgt_col1] This is optional.

.map_column_ids

List of mappings with the format key:value where key is the Arrow column name to map to the frame column position. Example: [arrow_col1:2, arrow_col2:1] This is optional.

When creating a ticket for a Flight read, specify the fully qualified name of the frame followed by optional parameters in the format, NAME[.OPTION=VALUE]. An example of reading a table with an offset and maximum length to return specified:

xgt_table = arrow_conn.do_get(pyarrow.flight.Ticket(b"graph__Table.offset=10.length=50")).read_all()

The following I/O options are supported by xGT’s Arrow endpoint when retrieving frames:

Arrow I/O get options for xGT

Option

Description

.offset

Index of the first row to retrieve. Default is 0.

.length

Maximum number of rows to retrieve. If not set, will return all the rows.

.order

If set to true, the order of the data will be the same as it is stored on the server. Otherwise, the order is not guaranteed. Default is false.

.egest_row_labels

If set to true, the security labels for each row will be egested along with the row. Default is false.

.duration_as_interval

If set to true, Duration will be returned as Parquet Interval binary type representation. Default is false.

.label_column_header

If set to a string, the column name for row labels egested will be this value. Default is ‘ROWLABEL’.