4.1. An Introduction to Graph Analysis with xGT¶

4.1.1. Setting up the xGT Environment¶

For this demonstration, we’ll be running the xGT Server on an Amazon EC2 instance and interacting with it over a secure SSH tunnel from our laptop. More details and alternative methods for getting xGT running can be found at docs.trovares.com.

4.1.1.1. Install `xgt` on Our Local Machine¶

!pip install --quiet --upgrade xgt

4.1.1.2. Checking the xGT Server Is up¶

# Assume "cloud" is the name of the server platform
!ssh cloud ps -e | grep xgtd

4.1.1.3. Establishing a Secure Tunnel¶

# Assume "ec2-user" is a username on the server platform
!ssh -fNL 4367:localhost:4367 ec2-user@cloud
!ps -A | grep 'ssh[ ]-fNL'

4.1.2. Getting Started with xGT¶

First, let’s import the package and ensure our connection is good.

import xgt
conn = xgt.Connection()
conn.set_default_namespace('lanl')
conn

  <xgt.connection.Connection at 0x11073cfd0>

4.1.3. Loading Data¶

We’ll be using the LANL Unified Host and Network Dataset, a set of netflow and host event data collected on an internal Los Alamos National Lab network.

Our goal will be to turn about 100GB of CSV files into a single connected graph.

4.1.3.1. Step 1: Create a Graph¶

Vertices: Los Alamos uses a free-form, anonymized string called a “device” as a host identifier (analogous to an IP address). We’ll use these as vertices in our graph:

[conn.drop_frame(_) for _ in ['Netflow','HostEvents','Devices']]

  [True, True, True]

devices = conn.create_vertex_frame(
    name='Devices',
    schema=[['device', xgt.TEXT]],
    key='device')

devices

  <xgt.graph.VertexFrame at 0x11073ca90>

Edges: The LANL dataset contains two types of data: netflow and windows log events. Of the log events recorded, some describe events within a host/device (e.g., reboots), and some describe authentication events that may be between devices (e.g., login from device A to B). We’ll call the authentication events AuthEvents and the others we’ll call HostEvents. In this notebook we load only the Netflow data and HostEvents.

netflow = conn.create_edge_frame(
    name='Netflow',
    schema=[['epoch_time', xgt.INT],
            ['duration', xgt.INT],
            ['src_device', xgt.TEXT],
            ['dst_device', xgt.TEXT],
            ['protocol', xgt.INT],
            ['src_port', xgt.INT],
            ['dst_port', xgt.INT],
            ['src_packets', xgt.INT],
            ['dst_packets', xgt.INT],
            ['src_bytes', xgt.INT],
            ['dst_bytes', xgt.INT]],
    source=devices,
    target=devices,
    source_key='src_device',
    target_key='dst_device')

netflow

  <xgt.graph.EdgeFrame at 0x11073c610>

host_events = conn.create_edge_frame(
    name='HostEvents',
    schema=[['epoch_time', xgt.INT],
            ['event_id', xgt.INT],
            ['log_host', xgt.TEXT],
            ['user_name', xgt.TEXT],
            ['domain_name', xgt.TEXT],
            ['logon_id', xgt.INT],
            ['process_name', xgt.TEXT],
            ['process_id', xgt.INT],
            ['parent_process_name', xgt.TEXT],
            ['parent_process_id', xgt.INT]],
    source=devices,
    target=devices,
    source_key='log_host',
    target_key='log_host')

host_events

  <xgt.graph.EdgeFrame at 0x11073ccd0>

4.1.3.2. Step 2: Load the Data¶

With all the data types described, we can actually add data fitting those descriptions directly from data files. We load these directly into the edge or vertex structures they correspond to.

# Utility to pretty-print the data currently in xGT
def print_data_summary():
  print('Devices (vertices): {:,}'.format(devices.num_vertices))
  print('Netflow (edges): {:,}'.format(netflow.num_edges))
  print('Host event (edges): {:,}'.format(host_events.num_edges))

print_data_summary()

  Devices (vertices): 0
  Netflow (edges): 0
  Host event (edges): 0

Load the 1-sided host event data:

%%time
urls = ["https://datasets.trovares.com/LANL/xgt/wls_day-02_1v.csv"]
host_events.load(urls)
print_data_summary()

  Devices (vertices): 10,086
  Netflow (edges): 0
  Host event (edges): 16,781,773
  CPU times: user 13.9 ms, sys: 18.2 ms, total: 32.1 ms
  Wall time: 51.3 s

Load the netflow data:

%%time
urls = ["https://datasets.trovares.com/LANL/xgt/nf_day-02.csv"]
netflow.load(urls)
print_data_summary()

  Devices (vertices): 31,324
  Netflow (edges): 115,949,436
  Host event (edges): 16,781,773
  CPU times: user 46.3 ms, sys: 73.8 ms, total: 120 ms
  Wall time: 3min 56s

4.1.4. Querying Our Graph¶

We’ll be looking for a mock pattern, similar to one that might be used to detect bot-net behavior. The pattern reflects an infected host (a) which is connecting up to a bot-net command and control node (b) with an exfiltration connection to a collection node (c).

Some device A boots up and, within a short amount of time, starts up a program.
Shortly afterwards, device A sends a message to some other device B.
Device B has a long-standing connection to another device C, which has been open for at least an hour, started before A booted, and remained open after A sent a message to B.

# Query helper function
import time
def run_query_and_count(query, parameters = None):
    conn.drop_frame('Answers')
    start_time = time.time()
    conn.wait_for_metrics()
    wait_time = time.time() - start_time
    if wait_time > 30:
      print('Time to wait for metrics: {:3,.2f}'.format(wait_time))
    conn.run_job(query, parameters = parameters)
    # Retrieve count
    table = conn.get_table_frame('Answers')
    count = table.get_data()[0][0]
    return count

4.1.4.1. Query 1: Search for Just a Boot Event¶

Normally, we might not know exactly which pattern to look for, and refining that pattern would be part of the analytic process. We’ll start by just finding devices with boot events.

%%time
q = """
MATCH (a)-[boot:HostEvents]->(a)
WHERE boot.event_id = 4608
RETURN count(*)
INTO Answers
"""
count = run_query_and_count(q)
print('Number of boot events: ' + '{:,}'.format(count))

  Number of boot events: 1,891
  CPU times: user 4.23 ms, sys: 2.98 ms, total: 7.2 ms
  Wall time: 454 ms

4.1.4.2. Query 2: Boot Event Followed by Program Start Event¶

Now we’ll refine the query by looking for programs launched within 4 seconds of a boot event.

%%time
q = """
MATCH (a)-[boot:HostEvents]->(a)-[program:HostEvents]->(a)
WHERE boot.event_id = 4608
  AND program.event_id = 4688
  AND program.epoch_time >= boot.epoch_time
  AND program.epoch_time - boot.epoch_time < $max_time_window
RETURN count(*)
INTO Answers
"""

count = run_query_and_count(q, parameters = {'max_time_window':4})
print('Number of boot & program start events: ' + '{:,}'.format(count))

  Number of boot & program start events: 547,759
  CPU times: user 6.12 ms, sys: 4.09 ms, total: 10.2 ms
  Wall time: 1.06 s

4.1.4.3. Query 3: Boot, Program Start, and Connection¶

Now we add the network connection edge from A to B from port 3128 (in our mock example, this is a signature port for the bot in question). We’ll restrict it such that all the pieces (boot, program start, and message) happen within 4 seconds.

%%time
q = """
MATCH (a)-[boot:HostEvents]->(a)-[program:HostEvents]->(a)
         <-[nf1:Netflow]-(b)
WHERE a <> b
  AND nf1.src_port = 3128
  AND boot.event_id = 4608
  AND program.event_id = 4688
  AND program.epoch_time >= boot.epoch_time
  AND nf1.epoch_time >= program.epoch_time
  AND nf1.epoch_time - boot.epoch_time < $max_time_window
RETURN count(*)
INTO Answers
"""
# Note the overall time limit on the sequence of the three events

count = run_query_and_count(q, parameters = {'max_time_window':4})
print('Number of boot, programstart, & nf0 events: ' + '{:,}'.format(count))

  Number of boot, programstart, & nf0 events: 75
  CPU times: user 7.48 ms, sys: 4.97 ms, total: 12.4 ms
  Wall time: 3.97 s

4.1.4.4. Query 4: Full Zombie Reboot Pattern¶

Finally, we add in the last network connection and match the full pattern.

%%time
q = """
MATCH (a)-[boot:HostEvents]->(a)-[program:HostEvents]->(a)
         <-[nf1:Netflow]-(b)-[nf2:Netflow]->(c)
WHERE a <> b AND b <> c AND a <> c
  AND nf1.src_port = 3128
  AND boot.event_id = 4608
  AND program.event_id = 4688
  AND program.epoch_time >= boot.epoch_time
  AND nf1.epoch_time >= program.epoch_time
  AND nf1.epoch_time - boot.epoch_time < $max_time_window
  AND nf2.duration >= $min_session_duration
  AND nf2.epoch_time < nf1.epoch_time
  AND nf2.epoch_time + nf2.duration >= nf1.epoch_time
RETURN count(*)
INTO Answers
"""

count = run_query_and_count(q, parameters = {'max_time_window':4, 'min_session_duration':3600})
print('Number of zombie reboot events: ' + '{:,}'.format(count))

  Number of zombie reboot events: 11,925
  CPU times: user 6.35 ms, sys: 5.89 ms, total: 12.2 ms
  Wall time: 6.92 s