2.4. Frame Management

This section describes how to manage frames, which are used to store data in xGT. A detailed description of the frame data model is found in Frames and Namespaces.

Frames in xGT can be created directly by specifying the frame schema. In this case frames are empty after creation and data must be loaded in as a separate step. The API calls are:

Frames can also be created implicitly from a data source. In this case, the schema is inferred from the data, the frame is created, and the data is loaded in one step. The API calls are:

Frames can be retrieved by name. The API calls are:

Frames can be deleted by dropping them. The API calls are:

A frame’s schema can be modified by adding, deleting, or reordering columns. The API calls are:

2.4.1. Direct Frame Creation

Frames in xGT can be created by the client API calls create_vertex_frame(), create_edge_frame() and create_table_frame(). Each of these API calls requires passing in the frame name and schema.

If the name parameter is a fully qualified name, the frame is created in the namespace specified. If it contains only the name of the frame, the frame is created in the default namespace. For more information on namespaces, see Frames and Namespaces. If a frame of the specified name already exists in the namespace, an exception is raised and the frame is not created again.

The schema describes the names and data types of each column of a frame. These columns correspond to the properties of each element of the frame. In the example below, the Employees frame has a schema with three columns and any vertex that will belong to this frame will have these three properties (the value of a property can be null if that column is not a key column):

server = xgt.Connection()

person_frame = server.create_vertex_frame(name = 'Employees',
                                          schema = [['person_id', xgt.INT],
                                                    ['name', xgt.TEXT],
                                                    ['start_date', xgt.DATE]],
                                          key = 'person_id')

A list of supported data types is found in Data Movement.

To create a vertex frame, one column of the schema must be specified as the key with the key parameter and will be used to uniquely identify vertices in the frame.

To create an edge frame, additional parameters specifying the source and target vertices must be provided. The source and target parameters indicate which vertex frames the edge frame is connecting. Each of these is either the name of a vertex frame or a VertexFrame object. They can be the same or different frames. The source_key and target_key parameters indicate which columns of the edge frame’s schema are used to identify the source vertex and target vertex, respectively, of the edge. For each row of the edge frame, the value assigned to the source_key column corresponds to the key column of a row in the vertex frame given by the source parameter. Similarly, for each row of the edge frame, the value assigned to the target_key column corresponds to the key column of a row in the vertex frame given by the target parameter. The remaining columns are properties of the edge.

2.4.1.1. Example

server = xgt.Connection()

person_frame = server.create_vertex_frame(name = 'Employees',
                                          schema = [['person_id', xgt.INT],
                                                    ['name', xgt.TEXT],
                                                    ['start_date', xgt.DATE]],
                                          key = 'person_id')

company_frame = server.create_vertex_frame(name = 'Companies',
                                           schema = [['company_id', xgt.INT],
                                                     ['name', xgt.TEXT],
                                                     ['profit', xgt.INT]],
                                           key = 'company_id')

friend_frame = server.create_edge_frame(name = 'FriendsWith',
                                        source = 'Employees',
                                        target = 'Employees',
                                        schema = [['source_id', xgt.INT],
                                                  ['target_id', xgt.INT]],
                                        source_key = 'source_id',
                                        target_key = 'target_id')

work_frame = server.create_edge_frame(name = 'WorksFor',
                                      source = 'Employees',
                                      target = 'Companies',
                                      schema = [['source_id', xgt.INT],
                                                ['target_id', xgt.INT],
                                                ['position', xgt.TEXT],
                                                ['years' , xgt.INT]],
                                      source_key = 'source_id',
                                      target_key = 'target_id')

The frame Employees contains vertices representing people. The key column person_id uniquely identifies each vertex, while the columns name and start_date give additional information, which need not be unique. Note that because the frame schema of Employees has three columns, each vertex within it must have three properties, one of them the unique key. The Companies frame contains vertices representing companies, with the company_id column uniquely identifying each vertex.

Frames FriendsWith and WorksFor define two edge frames, each connecting already defined vertex frames. With direct frame creation using create_edge_frame(), the source and target frames must be created before an edge frame is created. Note that an edge frame may connect vertices from the same vertex frame, as FriendsWith does, or may connect vertices from two different frames, as WorksFor does.

2.4.1.2. List Columns

Lists are supported for schema columns in all types of frames. To specify a list column in a schema the following is used:

['<column name>', xgt.LIST, <base type>, <depth>]

The column name can be any valid xGT column name. The type of the column must be xgt.LIST with the base type corresponding to any of xGT’s non-list types. The depth specifies how many levels of nesting are to be used. The default depth is 1, so specifying a simple list of integers can be done as follows:

['integer_list', xgt.LIST, xgt.INT]

A list of lists of integers is specified as:

['nested_integer_list', xgt.LIST, xgt.INT, 2]

List columns cannot be used as key columns for vertex frames and edge frames.

2.4.2. Implicit Frame Creation

xGT supports implicit frame creation from input data through the methods: create_vertex_frame_from_data(), create_edge_frame_from_data(), and create_table_frame_from_data(). These methods provide an easier way to create graphs in xGT, combining the frame creation and data loading steps, as well as automatically inferring the schema of the frame.

If the name parameter is a fully qualified name, the frame is created in the namespace specified. Otherwise, it is created in the default namespace. For more information on namespaces, see Frames and Namespaces. If a frame of the specified name already exists in the namespace, an exception is raised and the frame is not created again.

2.4.2.1. Implicit Vertex Frame Creation

To create a vertex frame from data using create_vertex_frame_from_data(), the data source, frame name, and key name must be passed in. The key parameter is the name of the column that contains the unique key identifying each vertex. If the data source is a CSV file with no header, then the key parameter should be an integer representing the position of the key column.

The example below shows creating a vertex frame named my_vertex from a pyarrow table, which must contain a column named id.

a_frame = conn.create_vertex_frame_from_data(pytab, name = 'my_vertex', key = 'id')

2.4.2.2. Implicit Edge Frame Creation

To create an edge frame from data using create_edge_frame_from_data(), the data, frame name, source vertex frame name, target vertex frame name, source key, and target key must be passed in.

The source and target parameters indicate which vertex frames the edge frame is connecting. Each of these is either the name of a vertex frame or a VertexFrame object. If either endpoint vertex frame does not already exist, it is created with a single column named “id”. Otherwise, the existing vertex frame is used if its schema is compatible.

The source_key and target_key parameters indicate which columns of the edge frame’s schema are used to identify the source vertex and target vertex, respectively, of the edge. These parameters should be string column names or in the case of a CSV file with no header, integer column positions. If the source key or target key columns of the data refer to any vertices not already in the source or target vertex frames, these will be implicitly inserted into the vertex frames.

The example below shows creating an edge frame named “WorksFor” along with two endpoint vertex frames. The source vertex frame named Employees is created beforehand, but the target vertex frame named Companies is automatically created. For each edge, the column of data.csv named employee_id identifies the source vertex in the Employee vertex frame and the column of data.csv named department_id identifies the target vertex in the Companies vertex frame.

conn.create_vertex_frame_from_data('employees.csv', name = 'Employees', key = 'person_id')

conn.create_edge_frame_from_data('data.csv', name = 'WorksFor',
                                 source = 'Employees', target = 'Companies',
                                 source_key = 'employee_id', target_key = 'department_id')

2.4.2.3. Implicit Table Frame Creation

To create a table frame from data using create_table_frame_from_data(), the data source and table frame name must be passed in. The example below shows creating a table frame named my_table from a Parquet file on the server filesystem.

a_frame = conn.create_table_frame_from_data('xgtd://data.parquet', name = 'my_table')

2.4.2.4. Schema Inference

The methods create_vertex_frame_from_data(), create_edge_frame_from_data(), and create_table_frame_from_data() allow the xGT frame schema, including the column names and data types, to be automatically inferred from the source data.

The schema inference can also be done directly with get_schema_from_data(). This returns an xGT frame schema from the data without creating any frame. This method can be used to see what schema would be used before creating a graph frame from the data. It also allows the user to adjust the schema before it is used to create a frame.

While xGT will try to infer data types, users who want more control over specifying the data types or column names of the schema can pass in a schema.

The example below shows using get_schema_from_data() to get a schema from the data, modifying its third column, and then passing the schema into create_vertex_frame_from_data(). The vertex frame is then created with a schema that includes a column of type IPADDRESS named “source_ip”.

# Get a schema from a CSV file.
my_schema = conn.get_schema_from_data('data.csv')
# Change the name of the third column in the schema.
my_schema[2][0] = 'source_ip'
# Change the data type of the third column of the schema.
my_schema[2][1] = xgt.IPADDRESS

# Create a frame from the data, but pass in the manually adjusted schema.
a_frame = conn.create_vertex_frame_from_data('data.csv', name = 'v', key = 'source_ip',
                                             schema = my_schema)

Any schema can be passed in as long as it is compatible with the data source. It does not have to be a schema returned by get_schema_from_data().

Scalar and list data types are supported in the input data for automatic schema inference. The following xGT data types can be automatically inferred from data:

  • BOOLEAN

  • INT

  • UINT

  • FLOAT

  • DATE

  • TIME

  • DATETIME

  • DURATION

  • TEXT

  • Nested lists of any of these types.

Schema inference will fail and return an XgtTypeError if the data contains any unsupported types.

2.4.2.5. Supported Data Sources

Frames can be automatically created from the following data types:

  • Arrow Table

  • Pandas DataFrame

  • CSV file(s)

  • Parquet file(s)

Arrow Tables

To create an xGT frame from an Arrow Table, use pyarrow. The data_source parameter must be a pyarrow Table.

The pyarrow Table schema must contain only the following data types:

  • boolean: maps to xGT BOOLEAN

  • signed integer: maps to xGT INT

  • unsigned integer: maps to xGT UINT

  • float32: maps to xGT FLOAT

  • float64: maps to xGT FLOAT

  • decimal128: maps to xGT FLOAT

  • decimal256: maps to xGT FLOAT

  • time32: maps to xGT TIME

  • time64: maps to xGT TIME

  • date32: maps to xGT DATE

  • date64: maps to xGT DATETIME

  • timestamp: maps to xGT DATETIME

  • duration: maps to xGT DURATION

  • string: maps to xGT TEXT

  • list: maps to an xGT list of the appropriate underlying type.

Pandas DataFrames

To create an xGT frame from pandas, the data_source parameter must be a pandas DataFrame. The data types are first translated to pyarrow types, which then map to xGT types as described above.

CSV files

To create an xGT frame from a CSV file, the data source parameter should be a string with the file path. It can also be a list of file names, possibly with wildcards in them. The files can be on the file system local to the client, local to the server, or on the web. As described in Data Movement, a file on the server filesystem if indicated with the xgtd:// protocol. Files on the web would use the standard wed addressing(URL) to access them such as https:// or s3://.

When creating from a CSV file, the delimiter and header_mode parameters can be passed in to parse the file. Note that if the CSV file has no header, the column names of the inferred schema will be named by default: “f0”, “f1”, etc. Otherwise, the header is used to assign schema column names.

The data in the file is first translated to pyarrow types, which then map to xGT types as described above. CSV files with non-uniform columns aren’t supported.

List support for CSV files is limited to lists of depth one. That is, lists that have a scalar type as their immediate child element.

Parquet files

To create an xGT frame from a Parquet file, the data source parameter should be a string with the file path. It can also be a list of Parquet file names, possibly with wildcards them. The files can be on the file system local to the client, local to the server, or on the web. A file on the server filesystem if indicated with the xgtd:// protocol. Files on the web would use the standard wed addressing(URL) to access them such as https:// or s3://. For any additional restrictions of loading into xGT from a Parquet file, see Loading Parquet Files.

The data in the file is first translated to pyarrow types, which then map to xGT types as described above.

Wildcards for file names

Wildcards are supported for file names in both the xgt:// and xgtd:// protocols as described in section Getting Data into xGT. Wildcards will be expanded using normal filesystem expansion rules on the respective operating systems of the client and the server machines. An example of wildcards in file names is: xgt://mylocal.csv.* which would expand to files matching the prefix mylocal.csv. on the directory running the client script. Note that in the case of wildcard expansion, all files must have the same column names if column mapping is used. They should also have the same number of columns.

Lists of file names

Lists of file names are supported as a data source on client, server, and the web. An example of this would be [ xgtd://workers.parquet, xgtd://persons.parquet ]. All files in the list must have the same column names if column mapping is used. They should also have the same number of columns. Wildcards can be used in the list elements as well: [ xgtd://myfile.parquet, xgtd://mysource.parquet.* ]. Mixed client, server, or web file locations are supported, but will run as separate transactions.

2.4.3. Retrieving Frames

If a frame has been previously created and already exists on the server, the client can be used to retrieve a proxy object to that frame. This is done with get_frame().

Additionally, get_frames() can be used to list existing table, vertex, and/or edge frames in a running xGT instance or in a given namespace.

2.4.4. Dropping Frames

All frame types can be deleted using the client API calls drop_frame() and drop_frames(). Note that a frame cannot be dropped if doing so creates an invalid graph. In order to drop a vertex frame, it must not be the source or target of any edge frame. This is the case even if the frames are empty. Frame Drop provides additional information about dropping frames.

2.4.5. Modifying Frame Columns

Existing frames can be modified by adding, deleting, and reordering columns.

2.4.5.1. Appending Columns

The method append_columns() appends new columns to the end of a frame’s rows. It requires a schema for the new columns to be passed in. The schema gives the name and type information for each new column. Adding a column name that is already in the frame’s schema is not allowed and raises an exception. The entries of a new column are initialized to None.

Consider the Employees frame already stored in the variable person_frame. The below example adds the date column end_date and the text column position to the frame.

person_frame.append_columns([['end_date', xgt.DATE],
                             ['position', xgt.TEXT]])

The new schema for the frame would be

[['person_id', xgt.INT],
 ['name', xgt.TEXT],
 ['start_date', xgt.DATE],
 ['end_date', xgt.DATE],
 ['position', xgt.TEXT]]

2.4.5.2. Deleting Columns

Columns can be deleted from a frame using the method delete_columns(). The method has a single parameter of the columns to be deleted. The columns are given as a mixture of names and integer positions. The order the columns are given in doesn’t matter. Invalid column names and out-of-bounds column positions result in an exception being raised. Key columns of vertex frames and source and target key columns of edge frames can’t be deleted, and attempting to do so raises an exception.

Consider the Employees frame as modified above. Each of the following code snippets deletes the start_date and end_date columns from the frame.

person_frame.delete_columns(['end_date', 'start_date'])
person_frame.delete_columns([2, 3])
person_frame.delete_columns([2, 'end_date'])

The new schema for the frame would be

[['person_id', xgt.INT],
 ['name', xgt.TEXT],
 ['position', xgt.TEXT]]

2.4.5.3. Modifying All Columns

The frame method modify_columns() redefines the schema of a frame allowing one method call to add, delete, and reorder columns. Unlike append_columns(), it allows adding columns anywhere in the schema, not just appending them. The method has a single parameter of the columns defining the new schema of the frame. The columns are given as a mixture of names, integer positions, and schema entries. Columns in the current schema can be referred to by any of name, integer position, or schema entry. New columns can only be given by a schema entry. Columns in the current schema not included in the new columns are deleted.

Giving a schema entry that has the same name as an existing schema column but different type results in an exception being raised. Duplicating a column name is not allowed and raises an exception. Invalid column names and out-of-bounds column positions result in an exception being raised. Key columns of vertex frames and source and target key columns of edge frames can’t be deleted, and attempting to do so raises an exception.

Consider the Employees frame as originally defined. The following code adds the new column position after the name column.

person_frame.modify_columns([0, 1, ['position', xgt.TEXT], 2])

The new schema for the frame would be

[['person_id', xgt.INT],
 ['name', xgt.TEXT],
 ['position', xgt.TEXT],
 ['start_date', xgt.DATE]]

Consider the Employees frame as originally defined. The following code deletes the column name.

person_frame.modify_columns(['person_id', 'start_date'])

The new schema for the frame would be

[['person_id', xgt.INT],
 ['start_date', xgt.DATE]]

Consider the Employees frame as originally defined. The following code snippets all reorder the columns to the same new order.

person_frame.modify_columns([1, 2, 0])
person_frame.modify_columns(['name', 'start_date', 'person_id'])
person_frame.modify_columns(['name', ['start_date', xgt.DATE], 0])

The new schema for the frame would be

[['name', xgt.TEXT],
 ['start_date', xgt.DATE],
 ['person_id', xgt.INT]]

Consider the Employees frame as originally defined. The following code snippets all add the position column, delete the start_date column, and give the same new order for the columns.

person_frame.modify_columns(['name', ['position', xgt.TEXT], 'person_id'])
person_frame.modify_columns([['name', xgt.TEXT], ['position', xgt.TEXT], 0])
person_frame.modify_columns([1, ['position', xgt.TEXT], 'person_id'])

The new schema for the frame would be

[['name', xgt.TEXT],
 ['position', xgt.TEXT],
 ['person_id', xgt.INT]]

2.4.5.4. Updating Columns

xGT allows the user to change the values on existing (or newly created) columns on a frame. The method update_columns() enables the user to update columns on a frame with data provided in Python.

The data to use for the updates can be Python native data, pandas data frame columns or pyarrow table columns. The updates are applied to the indicated columns in the stored order of the frame on the xGT server. The stored order of the frame is the same as that returned by the method get_data().

A simple example of updating columns on a frame is as follows:

person_frame.update_columns(['count'], [[i] for i in range (10)])

The call above would update the first 10 rows of the person_frame with values 0 through 9. If the frame has more than 10 rows, the subsequent rows are not updated.

Multiple columns can be updated at the same time:

person_frame.update_columns(['count', 'updated'], [[i, '2024-01-01'] for i in range (10)])

In this case, the first 10 rows of the frame would have the count and updated columns modified with the values in the example.

The update_columns() method also supports an offset parameter to specify the starting row on which to start the updates.

person_frame.update_columns(['count', 'updated'], [[i, '2024-01-01'] for i in range (10)], offset = 250)

In this case, the ten rows starting at row 250 will be updated with the provided data. As can be seen from the examples, the length of the provided data determines how many rows of the frame are updated.

If the offset is out-of-bounds with respect to the number of rows in the frame or if the length of the provided data exceeds the number of existing rows, an error will be reported.