2.3. Enterprise Data Movement¶
Moving data to or from a system other than the one running the xGT server can be an important component of an enterprise solution that uses the xGT graph product. One example involves running data cleansing and transformation on a scale-out cluster prior to bringing that data into xGT. Another example is pulling data from repositories, such as a data lake, where the user selects a specific slice of data to be analyzed with xGT.
Whenever the xGT server moves data between the filesystem and RAM, the location on the filesystem must be within the sandbox. The sandbox has a root directory, and contains all files in the root directory and all subdirectories. A configuration setting is available that controls whether to follow symbolic links (which could point to locations outside of the sandbox). Configuring the sandbox is described in Configuring the Data Sandbox.
2.3.1. Common Usage Patterns¶
For purposes of these examples, we will use the $SANDBOX
term to refer to the root directory of the sandbox.
2.3.1.1. Extract, Transform and Load (ETL)¶
A large SMP platform is not always the best platform for every job. The ETL operations are well-suited to running on a scale-out cluster. A typical outcome of running ETL this way is that the task ends with part-files on each node of the cluster. There is no need to combine these files when bringing them to the xGT server platform.
For a specific ETL job, $ETLJOB
, the data can be brought to the filesystem with this layout:
$SANDBOX/$ETLJOB/prefix_part00.csv
$SANDBOX/$ETLJOB/prefix_part01.csv
.
.
.
The Data Movement section describes how a user can request the movement of data to/from a file or a client Python environment.
With a single request, the xGT server can be instructed to ingest all files in a specific directory, for example: xgtd://$ETLJOB/
.
In addition, the xGT server will look into all subdirectories one level deep to find files for ingesting.
Note that the xGT server will ingest all of the part files concurrently.
It is also possible to leave the files compressed (either gzip or bzip2).
This will use less disk space on the platform running the xGT server, but comes at a cost of added time to ingest in two different ways.
First, the computation required to decompress is added to the ingest processing.
Second, the amount of parallelism is limited since a compressed file must be decompressed sequentially.
The example below shows bzip2
compression.
$SANDBOX/$ETLJOB/prefix_part00.csv.bz2
$SANDBOX/$ETLJOB/prefix_part01.csv.bz2
.
.
.
The request to ingest a collection of compressed files is no different than for uncompressed files.
In fact, the xGT server can read in files from a directory (e.g., xgtd://$ETLJOB/
) that holds many flat CSV files and many compressed CSV files.
The paths
parameter described in load()
method requires a Python sequence of file path names.
A Python script that builds a lengthy sequence of individual files or directories is another way to achieve parallelism during the xGT server ingest operation.
The collection of files to be ingested during a single call to the load()
method is constructed by combining:
all of the file path elements pointing to individual files
all of the files residing in directories pointed to by file path elements that are directories.
This entire collection of files is processed concurrently.
2.3.1.2. Periodic Data Feed¶
If an enterprise collects data on an on-going basis, it makes sense to automatically and routinely bring the newest data to the xGT data sandbox.
Assuming that new data is to arrive every hour, it makes sense to use a directory structure to organize historical data.
One way to do this is to have a directory hierarchy for a data source with year near the top, and 365 (or 366) days at the next level, with 24 files in each leaf directory.
For a specific $DATASOURCE
, the data collected at 10PM on January 14, 2020 would be found here:
$SANDBOX/$DATASOURCE/2020/14/22.csv
One convenient way to automatically ingest newly arriving data is to write a Python script to monitor the directory structures under $SANDBOX/$DATASOURCE
.
Whenever a new file arrives, the script can act as an xGT client and request that the xGT server ingest the new data.
2.3.2. Configuring the Data Sandbox¶
The section Configuring the xGT Server has more details on system.io_directory
and system.io_follow_symbolic_links
settings.
There is no way to run the xGT server with the notion of the sandbox disabled.
Every ingest/egest operation validates that the target file is within the sandbox boundary.
If no configuration is provided for the system.io_directory
setting, a reasonable default is used (see Configuring the xGT Server).
Some sites may want to allow following symbolic links to locations outside of the directory under the system.io_directory
root.
To allow that, the system.io_follow_symbolic_links
must be set to true
.
2.3.3. Granting Data Sandbox Access to Users¶
The default configuration does not allow any users (other than root or xgtd
) access to the $SANDBOX
portion of the host filesystem.
There are a couple of strategies to set up a user to be able to read and write data in the sandbox:
Add the user to the
xgtd
group. This strategy requires that the permissions on the/srv/xgtd/data
directory be modified to allow group members to create files.Add a symbolic link in the
/srv/xgtd/data/
directory to somewhere inside the user’s home directory structure. This strategy requires that the xGT server be configured to allow following symbolic links (see Configuring the Data Sandbox). The user may also need to adjust file permissions to allow thexgtd
user to read their data files.