2.20. TQL Fragments¶
TQL fragments provide a mechanism to evaluate TQL expressions as part of another operation.
load()
and insert()
use TQL fragments to transform raw input data and map input columns to frame columns using the input_filter
parameter.
The fragment must be passed as a Python string:
frame.load(file, input_filter = '<fragment code>')
frame.insert([ [ 0, 1 ], [ 1, 2 ]], input_filter = '<fragment code>')
A TQL fragment supports a subset of TQL’s clauses, expressions and functions that can be evaluated in a stateless manner.
A TQL fragment can only contain the clauses WHERE
and RETURN
to support filtering and output mapping respectively.
Other TQL clauses are not allowed in a fragment.
For input operations, a TQL fragments is applied independently to each row of the input data to produce the final output data stored in the frame.
The example below shows a simple TQL fragment that filters raw input data based on the value of one of the columns and computes the output columns of the target frame. The first column has ten added to it, while the second column is passed through unmodified.
WHERE a.value > 0
RETURN a.value + 10, a.text
The input to a TQL fragment is the raw data coming from the input operation (file or Python client data) organized in a single row. All columns present in the input are available to the fragment for evaluation. A schema is derived for the raw input data using type inference to map the input data types to xGT’s types. Column names are automatically computed from the schema.
TQL requires a bound variable to refer to columns that are part of the row.
TQL fragments can use any name for this implied bound variable, but only a single name is allowed.
In the example above, the bound variable a
is the only name allowed to refer to the input row.
An expression b.property
in the same TQL fragment will produce an error as it refers to a different bound variable.
2.20.1. Interaction with Different Data Sources¶
TQL fragments are supported for all of xGT’s input data sources:
Parquet files.
CSV files with and without headers.
Client data as Python list of lists, pandas DataFrames, or pyarrow Tables.
While the behavior of TQL fragments is uniform across all input sources, there are slight differences between the input sources with respect to schema data types and column names.
2.20.1.1. Parquet Files¶
Parquet files are the simplest case, as they contain a schema as part of their metadata. The column names contained in the file’s schema are used to refer to the input data’s columns. All xGT data types are inferred automatically from the native Parquet data types except for the IPADDRESS type. Columns that potentially contain IP address data are presented to the fragment as strings (or lists of strings for list types) and the TQL fragment must treat them as such. xGT supports passing those strings directly to the output frame’s IP address column if no processing is required.
frame = conn.create_table_frame(
name = 'graph__Table',
schema = [['col0', xgt.INT],
['col1', xgt.IPADDRESS]])
WHERE input.value > 0
RETURN input.value + 10, input.ipaddress_as_string
2.20.1.2. CSV Files¶
To process CSV files using a TQL fragment, xGT automatically infers a schema from the raw data. All xGT types (including IP addresses and lists of IP addresses) are inferred from the CSV input.
CSV files that include headers will use the column names declared in the header as the column names available in the TQL fragment.
CSV files without headers will use automatically generated column names starting with the prefix f
.
This code shows the frame’s schema as declared on creation.
frame = conn.create_table_frame(
name = 'graph__Table',
schema = [['column_a', xgt.INT],
['column_b', xgt.IPADDRESS]])
The raw CSV input data is:
0, 128.0.0.1
1, 128.0.0.2
2, 128.0.0.3
Because there are no headers in the CSV file, the inferred schema is:
schema = [['f0', xgt.INT],
['f1', xgt.IPADDRESS]]
The fragment must be written using those column names:
WHERE b.f0 > 0
RETURN b.f0 + 10, b.f1
2.20.1.3. Python Data¶
Python data is treated in a similar manner to Parquet data but without column names (as there is no schema metadata). Columns that contain IP address data will be inferred as columns of strings. Column names are automatically generated in the same manner as for CSV files without headers.
The code below shows the frame’s schema as declared on creation.
frame = conn.create_table_frame(
name = 'graph__Table',
schema = [['column_a', xgt.INT],
['column_b', xgt.IPADDRESS],
['column_c', xgt.LIST, xgt.TEXT]])
The raw Python input data is:
[[ 1, 128.0.0.1, [ "one" ]],
[ 2, 128.0.0.2, [ "one", "two" ]],
[ 3, 128.0.0.3, [ "one", "two", "three" ]]]
The inferred schema is:
schema = [['f0', xgt.INT],
['f1', xgt.TEXT],
['f2', xgt.LIST, xgt.TEXT]]
The fragment must be written using those column names:
WHERE python_data.f0 > 0
RETURN python_data.f0 + 10, python_data.f1, tail(python_data.f2)
The resulting data inserted into the frame is:
11, 128.0.0.1, [ ]
12, 128.0.0.2, [ "two" ]
13, 128.0.0.3, [ "two", "three" ]
Note that this example also illustrates how to process columns that contain lists. In this case, lists of strings.
2.20.2. Requirements for Output Columns¶
The RETURN clause in a TQL fragment specifies how the target frame’s columns will be populated for each row of raw input. The RETURN clause must have one entry per target frame column. Each entry must be of a type that is compatible with the exact type of the column via text encoding (see Encoding Text to Types).
If an output column does not matter or the fragment cannot compute it, then use the value NULL
for that column.
The exception is that key columns in vertex and edge frames cannot be NULL and must have a value.
Consider the following frame schema:
frame = conn.create_vertex_frame(
name = 'graph__Table',
schema = [['col0', xgt.INT],
['col1', xgt.IPADDRESS],
['col2', xgt.LIST, xgt.TEXT]],
key = 'col0')
Some examples of acceptable fragment output for this frame are:
RETURN a.f0, a.f1, a.f2
RETURN a.f0 + 2, "128.0.0.1", NULL
RETURN a.f0 / 4, a.f2, [ "one", "two" ]
2.20.3. Putting It All Together¶
Here is a full example of using TQL fragments to process input data as it is being inserted into a target frame.
Start with creating an empty vertex frame:
frame = conn.create_vertex_frame(
name = 'Systems',
schema = [['id', xgt.INT],
['ipaddr', xgt.IPADDRESS],
['users', xgt.LIST, xgt.TEXT]],
key = 'id')
We want to use IPv6 addresses for the entries in the Systems
vertex frame.
The source data has two 64-bit integer columns that will be used to construct the IPv6 address:
data = [[ 0, 1, 1, [ "alice" ] ],
[ 1, 2, 2, [ "alice", "bob" ] ],
[ 2, 3, 3, [ "bob", "charlie", "diana" ] ]]
This code inserts the data into the frame and process it into its target form using a TQL fragment:
frame.insert(data, input_filter = """
WHERE size(input.f3) > 1
RETURN input.f0, ipaddress(input.f1, input.f2), tail(input.f3)
""")
The resulting entries in the Systems
frame are:
1, ::2:0:0:0:2, [ "bob" ]
2, ::3:0:0:0:3, [ "charlie", "diana" ]
This example created rows with IPv6 addresses from the numeric entries as well as the tail of the input lists.
Consider another example using the same target frame and a CSV file with headers:
machine_id, ipaddr, users
10, '255.255.255.0', [ "u" ]
11, '255.255.255.1', [ "u", "v", "w" ]
12, '255.255.255.2', [ "x", "y", "z" ]
13, '255.255.255.3', [ "u", "a" ]
This code loads the data into the frame and filters it using a TQL fragment:
frame.load('xgtd://file.csv', input_filter = """
WHERE "u" IN input.users
RETURN input.machine_id, input.ipaddr, input.users
""")
The resulting entries in the Systems
frame are:
10, 255.255.255.0, [ "u" ]
11, 255.255.255.1, [ "u", "v", "w" ]
13, 255.255.255.3, [ "u", "a" ]