2.3. Enterprise Data Movement

Moving data to or from a system other than the one running the xGT server can be an important component of an enterprise solution that uses the xGT graph product. One example involves running data cleansing and transformation on a scale-out cluster prior to bringing that data into xGT. Another example is pulling data from repositories, such as a data lake, where the user selects a specific slice of data to be analyzed with xGT.

Whenever the xGT server moves data between the filesystem and RAM, the location on the filesystem must be within the sandbox. The sandbox has a root directory, and contains all files in the root directory and all subdirectories. A configuration setting is available that controls whether to follow symbolic links (which could point to locations outside the sandbox). Configuring the sandbox is described in Configuring the Data Sandbox.

2.3.1. Common Usage Patterns

For purposes of these examples, we will use the $SANDBOX term to refer to the root directory of the sandbox.

2.3.1.1. Extract, Transform and Load (ETL)

A large SMP platform is not always the best platform for every job. The ETL operations are well-suited to running on a scale-out cluster. A typical outcome of running ETL this way is that the task ends with part-files on each node of the cluster. There is no need to combine these files when bringing them to the xGT server platform.

For a specific ETL job, $ETLJOB, the data can be brought to the filesystem with this layout:

$SANDBOX/$ETLJOB/prefix_part00.csv
$SANDBOX/$ETLJOB/prefix_part01.csv
  .
  .
  .

The Data Movement section describes how a user can request the movement of data to/from a file or a client Python environment. With a single request, the xGT server can be instructed to ingest all files in a specific directory, for example: xgtd://$ETLJOB/. In addition, the xGT server will look into all subdirectories one level deep to find files for ingesting. Note that the xGT server will ingest all the part files concurrently.

It is also possible to leave the files compressed (either gzip or bzip2). This will use less disk space on the platform running the xGT server, but comes at a cost of added time to ingest in two different ways. First, the computation required to decompress is added to the ingest processing. Second, the amount of parallelism is limited since a compressed file must be decompressed sequentially. The example below shows bzip2 compression.

$SANDBOX/$ETLJOB/prefix_part00.csv.bz2
$SANDBOX/$ETLJOB/prefix_part01.csv.bz2
  .
  .
  .

The request to ingest a collection of compressed files is no different from uncompressed files. In fact, the xGT server can read in files from a directory (e.g., xgtd://$ETLJOB/) that holds many flat CSV files and many compressed CSV files.

The paths parameter described in load() method requires a Python sequence of file path names. A Python script that builds a lengthy sequence of individual files or directories is another way to achieve parallelism during the xGT server ingest operation. The collection of files to be ingested during a single call to the load() method is constructed by combining:

  • all the file path elements pointing to individual files

  • all the files residing in directories pointed to by file path elements that are directories.

This entire collection of files is processed concurrently.

2.3.1.2. Periodic Data Feed

If an enterprise collects data on an ongoing basis, it makes sense to automatically and routinely bring the newest data to the xGT data sandbox. Assuming that new data is to arrive every hour, it makes sense to use a directory structure to organize historical data. One way to do this is to have a directory hierarchy for a data source with year near the top, and 365 (or 366) days at the next level, with 24 files in each leaf directory. For a specific $DATASOURCE, the data collected at 10PM on January 14, 2020 would be found here:

$SANDBOX/$DATASOURCE/2020/14/22.csv

One convenient way to automatically ingest newly arriving data is to write a Python script to monitor the directory structures under $SANDBOX/$DATASOURCE. Whenever a new file arrives, the script can act as an xGT client and request that the xGT server ingest the new data.

2.3.2. Configuring the Data Sandbox

The section Configuring the xGT Server has more details on system.io_directory and system.io_follow_symbolic_links settings. There is no way to run the xGT server with the notion of the sandbox disabled. Every ingest/egest operation validates that the target file is within the sandbox boundary. If no configuration is provided for the system.io_directory setting, a reasonable default is used (see Configuring the xGT Server).

Some sites may want to allow following symbolic links to locations outside the directory under the system.io_directory root. To allow that, the system.io_follow_symbolic_links must be set to true.

2.3.3. Granting Data Sandbox Access to Users

The default configuration does not allow any users (other than root or xgtd) access to the $SANDBOX portion of the host filesystem. There are a couple of strategies to set up a user to be able to read and write data in the sandbox:

  • Add the user to the xgtd group. This strategy requires that the permissions on the /srv/xgtd/data directory be modified to allow group members to create files.

  • Add a symbolic link in the /srv/xgtd/data/ directory to somewhere inside the user’s home directory structure. This strategy requires that the xGT server be configured to allow following symbolic links (see Configuring the Data Sandbox). The user may also need to adjust file permissions to allow the xgtd user to read their data files.