2.4. Frame Management¶
This section describes how to manage frames, which are used to store data in xGT. A detailed description of the frame data model is found in Frames and Namespaces.
Frames in xGT can be created directly by specifying the frame schema. In this case frames are empty after creation and data must be loaded in as a separate step. The API calls are:
Frames can also be created implicitly from a data source. In this case, the schema is inferred from the data, the frame is created, and the data is loaded in one step. The API calls are:
Frames can be retrieved by name. The API calls are:
Frames can be deleted by dropping them. The API calls are:
A frame’s schema can be modified by adding, deleting, or reordering columns. The API calls are:
2.4.1. Direct Frame Creation¶
Frames in xGT can be created by the client API calls create_vertex_frame()
, create_edge_frame()
and create_table_frame()
.
Each of these API calls requires passing in the frame name and schema.
If the name
parameter is a fully qualified name, the frame is created in the namespace specified.
If it contains only the name of the frame, the frame is created in the default namespace.
For more information on namespaces, see Frames and Namespaces.
If a frame of the specified name already exists in the namespace, an exception is raised and the frame is not created again.
The schema describes the names and data types of each column of a frame.
These columns correspond to the properties of each element of the frame.
In the example below, the Employees
frame has a schema with three columns and any vertex that will belong to this frame will have these three properties (the value of a property can be null if that column is not a key column):
server = xgt.Connection()
person_frame = server.create_vertex_frame(name = 'Employees',
schema = [['person_id', xgt.INT],
['name', xgt.TEXT],
['start_date', xgt.DATE]],
key = 'person_id')
A list of supported data types is found in Data Movement.
To create a vertex frame, one column of the schema must be specified as the key with the key
parameter and will be used to uniquely identify vertices in the frame.
To create an edge frame, additional parameters specifying the source and target vertices must be provided.
The source
and target
parameters indicate which vertex frames the edge frame is connecting.
Each of these is either the name of a vertex frame or a VertexFrame
object.
They can be the same or different frames.
The source_key
and target_key
parameters indicate which columns of the edge frame’s schema are used to identify the source vertex and target vertex, respectively, of the edge.
For each row of the edge frame, the value assigned to the source_key
column corresponds to the key
column of a row in the vertex frame given by the source
parameter.
Similarly, for each row of the edge frame, the value assigned to the target_key
column corresponds to the key
column of a row in the vertex frame given by the target
parameter.
The remaining columns are properties of the edge.
2.4.1.1. Example¶
server = xgt.Connection()
person_frame = server.create_vertex_frame(name = 'Employees',
schema = [['person_id', xgt.INT],
['name', xgt.TEXT],
['start_date', xgt.DATE]],
key = 'person_id')
company_frame = server.create_vertex_frame(name = 'Companies',
schema = [['company_id', xgt.INT],
['name', xgt.TEXT],
['profit', xgt.INT]],
key = 'company_id')
friend_frame = server.create_edge_frame(name = 'FriendsWith',
source = 'Employees',
target = 'Employees',
schema = [['source_id', xgt.INT],
['target_id', xgt.INT]],
source_key = 'source_id',
target_key = 'target_id')
work_frame = server.create_edge_frame(name = 'WorksFor',
source = 'Employees',
target = 'Companies',
schema = [['source_id', xgt.INT],
['target_id', xgt.INT],
['position', xgt.TEXT],
['years' , xgt.INT]],
source_key = 'source_id',
target_key = 'target_id')
The frame Employees
contains vertices representing people.
The key column person_id
uniquely identifies each vertex, while the columns name
and start_date
give additional information, which need not be unique.
Note that because the frame schema of Employees
has three columns, each vertex within it must have three properties, one of them the unique key.
The Companies
frame contains vertices representing companies, with the company_id
column uniquely identifying each vertex.
Frames FriendsWith
and WorksFor
define two edge frames, each connecting already defined vertex frames.
With direct frame creation using create_edge_frame()
, the source and target frames must be created before an edge frame is created.
Note that an edge frame may connect vertices from the same vertex frame, as FriendsWith
does, or may connect vertices from two different frames, as WorksFor
does.
2.4.1.2. List Columns¶
Lists are supported for schema columns in all types of frames. To specify a list column in a schema the following is used:
['<column name>', xgt.LIST, <base type>, <depth>]
The column name can be any valid xGT column name.
The type of the column must be xgt.LIST
with the base type corresponding to any of xGT’s non-list types.
The depth specifies how many levels of nesting are to be used.
The default depth is 1, so specifying a simple list of integers can be done as follows:
['integer_list', xgt.LIST, xgt.INT]
A list of lists of integers is specified as:
['nested_integer_list', xgt.LIST, xgt.INT, 2]
List columns cannot be used as key columns for vertex frames and edge frames.
2.4.2. Implicit Frame Creation¶
xGT supports implicit frame creation from input data through the methods: create_vertex_frame_from_data()
, create_edge_frame_from_data()
, and create_table_frame_from_data()
.
These methods provide an easier way to create graphs in xGT, combining the frame creation and data loading steps, as well as automatically inferring the schema of the frame.
If the name
parameter is a fully qualified name, the frame is created in the namespace specified.
Otherwise, it is created in the default namespace.
For more information on namespaces, see Frames and Namespaces.
If a frame of the specified name already exists in the namespace, an exception is raised and the frame is not created again.
2.4.2.1. Implicit Vertex Frame Creation¶
To create a vertex frame from data using create_vertex_frame_from_data()
, the data source, frame name, and key name must be passed in.
The key
parameter is the name of the column that contains the unique key identifying each vertex.
If the data source is a CSV file with no header, then the key
parameter should be an integer representing the position of the key column.
The example below shows creating a vertex frame named my_vertex
from a pyarrow
table, which must contain a column named id
.
a_frame = conn.create_vertex_frame_from_data(pytab, name = 'my_vertex', key = 'id')
2.4.2.2. Implicit Edge Frame Creation¶
To create an edge frame from data using create_edge_frame_from_data()
, the data, frame name, source vertex frame name, target vertex frame name, source key, and target key must be passed in.
The source
and target
parameters indicate which vertex frames the edge frame is connecting.
Each of these is either the name of a vertex frame or a VertexFrame
object.
If either endpoint vertex frame does not already exist, it is created with a single column named “id”.
Otherwise, the existing vertex frame is used if its schema is compatible.
The source_key
and target_key
parameters indicate which columns of the edge frame’s schema are used to identify the source vertex and target vertex, respectively, of the edge.
These parameters should be string column names or in the case of a CSV file with no header, integer column positions.
If the source key or target key columns of the data refer to any vertices not already in the source or target vertex frames, these will be implicitly inserted into the vertex frames.
The example below shows creating an edge frame named “WorksFor” along with two endpoint vertex frames.
The source vertex frame named Employees
is created beforehand, but the target vertex frame named Companies
is automatically created.
For each edge, the column of data.csv named employee_id
identifies the source vertex in the Employee
vertex frame and the column of data.csv named department_id
identifies the target vertex in the Companies
vertex frame.
conn.create_vertex_frame_from_data('employees.csv', name = 'Employees', key = 'person_id')
conn.create_edge_frame_from_data('data.csv', name = 'WorksFor',
source = 'Employees', target = 'Companies',
source_key = 'employee_id', target_key = 'department_id')
2.4.2.3. Implicit Table Frame Creation¶
To create a table frame from data using create_table_frame_from_data()
, the data source and table frame name must be passed in.
The example below shows creating a table frame named my_table
from a Parquet file on the server filesystem.
a_frame = conn.create_table_frame_from_data('xgtd://data.parquet', name = 'my_table')
2.4.2.4. Schema Inference¶
The methods create_vertex_frame_from_data()
, create_edge_frame_from_data()
, and create_table_frame_from_data()
allow the xGT frame schema, including the column names and data types, to be automatically inferred from the source data.
The schema inference can also be done directly with get_schema_from_data()
.
This returns an xGT frame schema from the data without creating any frame.
This method can be used to see what schema would be used before creating a graph frame from the data.
It also allows the user to adjust the schema before it is used to create a frame.
While xGT will try to infer data types, users who want more control over specifying the data types or column names of the schema can pass in a schema.
The example below shows using get_schema_from_data()
to get a schema from the data, modifying its third column, and then passing the schema into create_vertex_frame_from_data()
.
The vertex frame is then created with a schema that includes a column of type IPADDRESS
named “source_ip”.
# Get a schema from a CSV file.
my_schema = conn.get_schema_from_data('data.csv')
# Change the name of the third column in the schema.
my_schema[2][0] = 'source_ip'
# Change the data type of the third column of the schema.
my_schema[2][1] = xgt.IPADDRESS
# Create a frame from the data, but pass in the manually adjusted schema.
a_frame = conn.create_vertex_frame_from_data('data.csv', name = 'v', key = 'source_ip',
schema = my_schema)
Any schema can be passed in as long as it is compatible with the data source.
It does not have to be a schema returned by get_schema_from_data()
.
Scalar and list data types are supported in the input data for automatic schema inference. The following xGT data types can be automatically inferred from data:
BOOLEAN
INT
UINT
FLOAT
DATE
TIME
DATETIME
DURATION
TEXT
Nested lists of any of these types.
Schema inference will fail and return an XgtTypeError
if the data contains any unsupported types.
2.4.2.5. Supported Data Sources¶
Frames can be automatically created from the following data types:
Arrow Table
Pandas DataFrame
CSV file(s)
Parquet file(s)
Arrow Tables
To create an xGT frame from an Arrow Table, use pyarrow
.
The data_source
parameter must be a pyarrow
Table.
The pyarrow
Table schema must contain only the following data types:
boolean: maps to xGT
BOOLEAN
signed integer: maps to xGT
INT
unsigned integer: maps to xGT
UINT
float32: maps to xGT
FLOAT
float64: maps to xGT
FLOAT
decimal128: maps to xGT
FLOAT
decimal256: maps to xGT
FLOAT
time32: maps to xGT
TIME
time64: maps to xGT
TIME
date32: maps to xGT
DATE
date64: maps to xGT
DATETIME
timestamp: maps to xGT
DATETIME
duration: maps to xGT
DURATION
string: maps to xGT
TEXT
list: maps to an xGT list of the appropriate underlying type.
Pandas DataFrames
To create an xGT frame from pandas, the data_source
parameter must be a pandas
DataFrame.
The data types are first translated to pyarrow
types, which then map to xGT types as described above.
CSV files
To create an xGT frame from a CSV file, the data source
parameter should be a string with the file path.
It can also be a list of file names, possibly with wildcards in them.
The files can be on the file system local to the client, local to the server, or on the web.
As described in Data Movement, a file on the server filesystem if indicated with the xgtd://
protocol.
Files on the web would use the standard wed addressing(URL) to access them such as https://
or s3://
.
When creating from a CSV file, the delimiter
and header_mode
parameters can be passed in to parse the file.
Note that if the CSV file has no header, the column names of the inferred schema will be named by default: “f0”, “f1”, etc.
Otherwise, the header is used to assign schema column names.
The data in the file is first translated to pyarrow
types, which then map to xGT types as described above.
CSV files with non-uniform columns aren’t supported.
List support for CSV files is limited to lists of depth one. That is, lists that have a scalar type as their immediate child element.
Parquet files
To create an xGT frame from a Parquet file, the data source
parameter should be a string with the file path.
It can also be a list of Parquet file names, possibly with wildcards them.
The files can be on the file system local to the client, local to the server, or on the web.
A file on the server filesystem if indicated with the xgtd://
protocol.
Files on the web would use the standard wed addressing(URL) to access them such as https://
or s3://
.
For any additional restrictions of loading into xGT from a Parquet file, see Loading Parquet Files.
The data in the file is first translated to pyarrow
types, which then map to xGT types as described above.
Wildcards for file names
Wildcards are supported for file names in both the xgt://
and xgtd://
protocols as described in section Getting Data into xGT.
Wildcards will be expanded using normal filesystem expansion rules on the respective operating systems of the client and the server machines.
An example of wildcards in file names is: xgt://mylocal.csv.*
which would expand to files matching the prefix mylocal.csv.
on the directory running the client script.
Note that in the case of wildcard expansion, all files must have the same column names if column mapping is used.
They should also have the same number of columns.
Lists of file names
Lists of file names are supported as a data source on client, server, and the web.
An example of this would be [ xgtd://workers.parquet, xgtd://persons.parquet ]
.
All files in the list must have the same column names if column mapping is used.
They should also have the same number of columns.
Wildcards can be used in the list elements as well:
[ xgtd://myfile.parquet, xgtd://mysource.parquet.* ]
.
Mixed client, server, or web file locations are supported, but will run as separate transactions.
2.4.3. Retrieving Frames¶
If a frame has been previously created and already exists on the server, the client can be used to retrieve a proxy object to that frame.
This is done with get_frame()
.
Additionally, get_frames()
can be used to list existing table, vertex, and/or edge frames in a running xGT instance or in a given namespace.
2.4.4. Dropping Frames¶
All frame types can be deleted using the client API calls drop_frame()
and drop_frames()
.
Note that a frame cannot be dropped if doing so creates an invalid graph.
In order to drop a vertex frame, it must not be the source or target of any edge frame.
This is the case even if the frames are empty.
Frame Drop provides additional information about dropping frames.
2.4.5. Modifying Frame Columns¶
Existing frames can be modified by adding, deleting, and reordering columns.
2.4.5.1. Appending Columns¶
The method append_columns()
appends new columns to the end of a frame’s rows.
It requires a schema for the new columns to be passed in.
The schema gives the name and type information for each new column.
Adding a column name that is already in the frame’s schema is not allowed and raises an exception.
The entries of a new column are initialized to None
.
Consider the Employees
frame already stored in the variable person_frame
.
The below example adds the date column end_date
and the text column position
to the frame.
person_frame.append_columns([['end_date', xgt.DATE],
['position', xgt.TEXT]])
The new schema for the frame would be
[['person_id', xgt.INT],
['name', xgt.TEXT],
['start_date', xgt.DATE],
['end_date', xgt.DATE],
['position', xgt.TEXT]]
2.4.5.2. Deleting Columns¶
Columns can be deleted from a frame using the method delete_columns()
.
The method has a single parameter of the columns to be deleted.
The columns are given as a mixture of names and integer positions.
The order the columns are given in doesn’t matter.
Invalid column names and out-of-bounds column positions result in an exception being raised.
Key columns of vertex frames and source and target key columns of edge frames can’t be deleted, and attempting to do so raises an exception.
Consider the Employees
frame as modified above.
Each of the following code snippets deletes the start_date
and end_date
columns from the frame.
person_frame.delete_columns(['end_date', 'start_date'])
person_frame.delete_columns([2, 3])
person_frame.delete_columns([2, 'end_date'])
The new schema for the frame would be
[['person_id', xgt.INT],
['name', xgt.TEXT],
['position', xgt.TEXT]]
2.4.5.3. Modifying All Columns¶
The frame method modify_columns()
redefines the schema of a frame allowing one method call to add, delete, and reorder columns.
Unlike append_columns()
, it allows adding columns anywhere in the schema, not just appending them.
The method has a single parameter of the columns defining the new schema of the frame.
The columns are given as a mixture of names, integer positions, and schema entries.
Columns in the current schema can be referred to by any of name, integer position, or schema entry.
New columns can only be given by a schema entry.
Columns in the current schema not included in the new columns are deleted.
Giving a schema entry that has the same name as an existing schema column but different type results in an exception being raised. Duplicating a column name is not allowed and raises an exception. Invalid column names and out-of-bounds column positions result in an exception being raised. Key columns of vertex frames and source and target key columns of edge frames can’t be deleted, and attempting to do so raises an exception.
Consider the Employees
frame as originally defined.
The following code adds the new column position
after the name
column.
person_frame.modify_columns([0, 1, ['position', xgt.TEXT], 2])
The new schema for the frame would be
[['person_id', xgt.INT],
['name', xgt.TEXT],
['position', xgt.TEXT],
['start_date', xgt.DATE]]
Consider the Employees
frame as originally defined.
The following code deletes the column name
.
person_frame.modify_columns(['person_id', 'start_date'])
The new schema for the frame would be
[['person_id', xgt.INT],
['start_date', xgt.DATE]]
Consider the Employees
frame as originally defined.
The following code snippets all reorder the columns to the same new order.
person_frame.modify_columns([1, 2, 0])
person_frame.modify_columns(['name', 'start_date', 'person_id'])
person_frame.modify_columns(['name', ['start_date', xgt.DATE], 0])
The new schema for the frame would be
[['name', xgt.TEXT],
['start_date', xgt.DATE],
['person_id', xgt.INT]]
Consider the Employees
frame as originally defined.
The following code snippets all add the position
column, delete the start_date
column, and give the same new order for the columns.
person_frame.modify_columns(['name', ['position', xgt.TEXT], 'person_id'])
person_frame.modify_columns([['name', xgt.TEXT], ['position', xgt.TEXT], 0])
person_frame.modify_columns([1, ['position', xgt.TEXT], 'person_id'])
The new schema for the frame would be
[['name', xgt.TEXT],
['position', xgt.TEXT],
['person_id', xgt.INT]]