An Introduction to Graph Analysis with xGT

Download this jupyter notebook for an interactive experience.


Setting up the xGT environment

For this demonstration, we'll be running the xGT Server on an Amazon EC2 instance and interacting with it over a secure ssh tunnel from our laptop. More details and alternative methods for getting xGT running can be found at docs.trovares.com.

Connection Setup

Install xgt on our local machine

!pip install --quiet --upgrade xgt

Checking the xGT server is up

!ssh cloud ps -e | grep xgtd
 1377 ?        00:00:00 xgtd

Establishing a secure tunnel

!ssh -fNL 4367:localhost:4367 ec2-user@cloud
!ps -A | grep 'ssh[ ]-fNL'
44348 ??         0:00.00 ssh -fNL 4367:localhost:4367 ec2-user@cloud

Getting started with xGT

First, let's import the package and ensure our connection is good.

import xgt
conn = xgt.Connection()
conn
<xgt.connection.Connection at 0x110634390>

Loading data

We'll be using the LANL Unified Host and Network Dataset, a set of netflow and host event data collected on an internal Los Alamos National Lab network.

Our goal will be to turn about 100GB of CSV files into a single connected graph.

Step 1: Create a Graph

Data model

Vertices: Los Alamos uses a free-form, anonymized string called a "device" as a host identifier (analogous to an IP address). We'll use these as vertices in our graph:

[conn.drop_frame(_) for _ in ['Netflow','HostEvents','Devices']]
[True, True, True]
devices = conn.create_vertex_frame(
    name='Devices',
    schema=[['device', xgt.TEXT]],
    key='device')

devices
<xgt.graph.VertexFrame at 0x1042a7110>

Edges: The LANL dataset contains two types of data: netflow and windows log events. Of the log events recorded, some describe events within a host/device (e.g., reboots), and some describe authentication events that may be between devices (e.g., login from device A to B). We'll call the authentication events AuthEvents and the others we'll call HostEvents. In this notebook we load only the Netflow data and HostEvents.

netflow = conn.create_edge_frame(
    name='Netflow',
    schema=[['epoch_time', xgt.INT],
            ['duration', xgt.INT],
            ['src_device', xgt.TEXT],
            ['dst_device', xgt.TEXT],
            ['protocol', xgt.INT],
            ['src_port', xgt.INT],
            ['dst_port', xgt.INT],
            ['src_packets', xgt.INT],
            ['dst_packets', xgt.INT],
            ['src_bytes', xgt.INT],
            ['dst_bytes', xgt.INT]],
    source=devices,
    target=devices,
    source_key='src_device',
    target_key='dst_device')

netflow
<xgt.graph.EdgeFrame at 0x10427c4d0>
host_events = conn.create_edge_frame(
    name='HostEvents',
    schema=[['epoch_time', xgt.INT],
            ['event_id', xgt.INT],
            ['log_host', xgt.TEXT],
            ['user_name', xgt.TEXT],
            ['domain_name', xgt.TEXT],
            ['logon_id', xgt.INT],
            ['process_name', xgt.TEXT],
            ['process_id', xgt.INT],
            ['parent_process_name', xgt.TEXT],
            ['parent_process_id', xgt.INT]],
    source=devices,
    target=devices,
    source_key='log_host',
    target_key='log_host')

host_events
<xgt.graph.EdgeFrame at 0x104301ed0>

Step 2: Load the data

With all the data types described, we can actually add data fitting those descriptions directly from data files. We load these directly into the edge or vertex structures they correspond to.

# Utility to pretty-print the data currently in xGT
def print_data_summary():
  print('Devices (vertices): {:,}'.format(devices.num_vertices))
  print('Netflow (edges): {:,}'.format(netflow.num_edges))
  print('Host event (edges): {:,}'.format(host_events.num_edges))

print_data_summary()
Devices (vertices): 0
Netflow (edges): 0
Host event (edges): 0

Load the 1-sided host event data:

%%time
urls = ["https://datasets.trovares.com/LANL/xgt/wls_day-%02d_1v.csv" % int(i) for i in range(4,5)]
#urls = ["xgtd://wls_day-%02d_1v.csv" % int(i) for i in range(4,5)]
host_events.load(urls)
print_data_summary()
Devices (vertices): 10170
Netflow (edges): 0
Host event (edges): 16402438
CPU times: user 51.2 ms, sys: 15.3 ms, total: 66.5 ms
Wall time: 31.1 s

Load the netflow data:

%%time
urls = ["https://datasets.trovares.com/LANL/xgt/nf_day-%02d.csv" % int(i) for i in range(4,5)]
#urls = ["xgtd://nf_day-%02d.csv" % int(i) for i in range(4,5)]
netflow.load(urls)
print_data_summary()
Devices (vertices): 157949
Netflow (edges): 222323503
Host event (edges): 16402438
CPU times: user 137 ms, sys: 40.8 ms, total: 178 ms
Wall time: 4min 25s

Querying our graph

We'll be looking for a mock pattern, similar to one that might be used to detect bot-net behavior. The pattern reflects an infected host (a) which is connecting up to a bot-net command and control node (b) with an exfiltration connection to a collection node (c).

zombie-reboot-final.png

# Query helper function
import time
def run_query_and_count(query):
    conn.drop_frame('Answers')
    start_time = time.time()
    conn.wait_for_metrics()
    wait_time = time.time() - start_time
    if wait_time > 30:
      print('Time to wait for metrics: {:3,.2f}'.format(wait_time))
    conn.run_job(query)
    # Retrieve count
    table = conn.get_table_frame('Answers')
    count = table.get_data()[0][0]
    return count

Query 1: Search for just a boot event

Normally, we might not know exactly which pattern to look for, and refining that pattern would be part of the analytic process. We'll start by just finding devices with boot events.

zombie-reboot-boot.png

%%time
q = """
MATCH (a)-[boot:HostEvents]->(a)
WHERE boot.event_id = 4608
RETURN count(*)
INTO Answers
"""
count = run_query_and_count(q)
print('Number of boot events: ' + '{:,}'.format(count))
Number of boot events: 1,712
CPU times: user 10.9 ms, sys: 4.06 ms, total: 15 ms
Wall time: 446 ms

Query 2: Boot event followed by program start event

Now we'll refine the query by looking for programs launched within 4 seconds of a boot event.

zombie-reboot-program.png

%%time
q = """
MATCH (a)-[boot:HostEvents]->(a)-[program:HostEvents]->(a)
WHERE boot.event_id = 4608
  AND program.event_id = 4688
  AND program.epoch_time >= boot.epoch_time
  AND program.epoch_time - boot.epoch_time < 4
RETURN count(*)
INTO Answers
"""

count = run_query_and_count(q)
print('Number of boot & program start events: ' + '{:,}'.format(count))
Number of boot & program start events: 253,009
CPU times: user 12.8 ms, sys: 4.08 ms, total: 16.8 ms
Wall time: 691 ms

Query 3: Boot, program start, and connection

Now we add the network connection edge from A to B from port 3128 (in our mock example, this is a signature port for the bot in question). We'll restrict it such that all the pieces (boot, program start, and message) happen within 4 seconds.

zombie-reboot-flow1.png

%%time
q = """
MATCH (a)-[boot:HostEvents]->(a)-[program:HostEvents]->(a)<-[nf1:Netflow]-(b)
WHERE a <> b
  AND nf1.src_port = 3128
  AND boot.event_id = 4608
  AND program.event_id = 4688
  AND program.epoch_time >= boot.epoch_time
  AND nf1.epoch_time >= program.epoch_time
  AND nf1.epoch_time - boot.epoch_time < 4
RETURN count(*)
INTO Answers
"""
# Note the overall time limit on the sequence of the three events

count = run_query_and_count(q)
print('Number of boot, programstart, & nf0 events: ' + '{:,}'.format(count))
Number of boot, programstart, & nf0 events: 109
CPU times: user 19.8 ms, sys: 6.28 ms, total: 26.1 ms
Wall time: 3.35 s

Query 4: Full zombie reboot pattern

Finally, we add in the last network connection and match the full pattern.

zombie-reboot-final.png

%%time
q = """
MATCH (a)-[boot:HostEvents]->(a)-[program:HostEvents]->(a)
         <-[nf1:Netflow]-(b)-[nf2:Netflow]->(c)
WHERE a <> b AND b <> c AND a <> c
  AND nf1.src_port = 3128
  AND boot.event_id = 4608
  AND program.event_id = 4688
  AND program.epoch_time >= boot.epoch_time
  AND nf1.epoch_time >= program.epoch_time
  AND nf1.epoch_time - boot.epoch_time < 4
  AND nf2.duration >= 3600
  AND nf2.epoch_time < nf1.epoch_time
  AND nf2.epoch_time + nf2.duration >= nf1.epoch_time
RETURN count(*)
INTO Answers
"""

count = run_query_and_count(q)
print('Number of zombie reboot events: ' + '{:,}'.format(count))
Number of zombie reboot events: 981
CPU times: user 34.2 ms, sys: 10.5 ms, total: 44.8 ms
Wall time: 14.4 s