An Introduction to Graph Analysis with xGT

Download this jupyter notebook for an interactive experience.

Setting up the xGT environment

For this demonstration, we'll be running the xGT Server on an Amazon EC2 instance and interacting with it over a secure ssh tunnel from our laptop. More details and alternative methods for getting xGT running can be found at docs.trovares.com.

Connection Setup

Install `xgt` on our local machine

!pip install --quiet --upgrade http://developer.trovares.com/download/python/latest/xgt-latest.tar.gz

Checking the xGT server is up

!ssh cloud ps -e | grep xgtd

 1377 ?        00:00:00 xgtd

Establishing a secure tunnel

!ssh -fNL 4367:localhost:4367 ec2-user@cloud
!ps -A | grep 'ssh[ ]-fNL'

44348 ??         0:00.00 ssh -fNL 4367:localhost:4367 ec2-user@cloud

Getting started with xGT

First, let's import the package and ensure our connection is good.

import xgt
conn = xgt.Connection()
conn

<xgt.connection.Connection at 0x110634390>

Loading data

We'll be using the LANL Unified Host and Network Dataset, a set of netflow and host event data collected on an internal Los Alamos National Lab network.

Our goal will be to turn about 100GB of CSV files into a single connected graph.

Step 1: Create a Graph

Data model

Vertices: Los Alamos uses a free-form, anonymized string called a "device" as a host identifier (analogous to an IP address). We'll use these as vertices in our graph:

[conn.drop_frame(_) for _ in ['Netflow','Events','Devices']]

[True, True, True]

devices = conn.create_vertex_frame(
            name='Devices',
            schema=[['device', xgt.TEXT]],
            key='device')
devices

<xgt.graph.VertexFrame at 0x1042a7110>

Edges: The LANL dataset contains two types of data: netflow and host events. Of the host events recorded, some describe events within a device (e.g., reboots), and some describe events between devices (e.g., login attempts). We'll only be loading the netflow data and in-device events. We call these events "one-sided", since we describe them as graph edges from one vertex to itself.

netflow = conn.create_edge_frame(
            name='Netflow',
            schema=[['epochtime', xgt.INT],
                    ['duration', xgt.INT],
                    ['srcDevice', xgt.TEXT],
                    ['dstDevice', xgt.TEXT],
                    ['protocol', xgt.INT],
                    ['srcPort', xgt.INT],
                    ['dstPort', xgt.INT],
                    ['srcPackets', xgt.INT],
                    ['dstPackets', xgt.INT],
                    ['srcBytes', xgt.INT],
                    ['dstBytes', xgt.INT]],
            source=devices,
            target=devices,
            source_key='srcDevice',
            target_key='dstDevice')
netflow

<xgt.graph.EdgeFrame at 0x10427c4d0>

events = conn.create_edge_frame(
           name='Events',
           schema=[['epochtime', xgt.INT],
                   ['eventID', xgt.INT],
                   ['logHost', xgt.TEXT],
                   ['userName', xgt.TEXT],
                   ['domainName', xgt.TEXT],
                   ['logonID', xgt.INT],
                   ['processName', xgt.TEXT],
                   ['processID', xgt.INT],
                   ['parentProcessName', xgt.TEXT],
                   ['parentProcessID', xgt.INT]],
           source=devices,
           target=devices,
           source_key='logHost',
           target_key='logHost')
events

<xgt.graph.EdgeFrame at 0x104301ed0>

Step 2: Load the data

With all the data types described, we can actually add data fitting those descriptions directly from data files. We load these directly into the edge or vertex structures they correspond to.

# Utility to pretty-print the data currently in xGT
def print_data_summary():
  print('Devices (vertices): {:,}'.format(devices.num_vertices))
  print('Netflow (edges): {:,}'.format(netflow.num_edges))
  print('Host event (edges): {:,}'.format(events.num_edges))

print_data_summary()

Devices (vertices): 0
Netflow (edges): 0
Host event (edges): 0

Load the 1-sided host event data:

%%time
urls = ["http://datasets.trovares.com/LANL/xgt/wls_day-%02d_1v.csv" % int(i) for i in range(4,5)]
#urls = ["xgtd://wls_day-%02d_1v.csv" % int(i) for i in range(4,5)]
events.load(urls)
print_data_summary()

Devices (vertices): 10170
Netflow (edges): 0
Host event (edges): 16402438
CPU times: user 51.2 ms, sys: 15.3 ms, total: 66.5 ms
Wall time: 31.1 s

Load the netflow data:

%%time
urls = ["http://datasets.trovares.com/LANL/xgt/nf_day-%02d.csv" % int(i) for i in range(4,5)]
#urls = ["xgtd://nf_day-%02d.csv" % int(i) for i in range(4,5)]
netflow.load(urls)
print_data_summary()

Devices (vertices): 157949
Netflow (edges): 222323503
Host event (edges): 16402438
CPU times: user 137 ms, sys: 40.8 ms, total: 178 ms
Wall time: 4min 25s

Querying our graph

We'll be looking for a mock pattern, similar to one that might be used to detect bot-net behavior. The pattern reflects an infected host (A) which is connecting up to a bot-net command and control node (B) with an exfiltration connection to a collection node (C).

Some device A boots up and, within a short amount of time, starts up a program.
Shortly afterwards, device A sends a message to some other device B.
Device B has a long-standing connection to another device C, which has been open for at least an hour, started before A booted, and remained open after A sent a message to B.

# Query helper function
def run_query_and_count(query):
    conn.drop_frame('answers')
    conn.run_job(query)
    # Retrieve count
    table = conn.get_table_frame('answers')
    count = table.get_data()[0][0]
    return count

Query 1: Search for just a boot event

Normally, we might not know exactly which pattern to look for, and refining that pattern would be part of the analytic process. We'll start by just finding devices with boot events.

%%time
q = """
MATCH (A)-[boot:Events]->(A)
WHERE boot.eventID = 4608
RETURN COUNT(*)
INTO answers
"""
count = run_query_and_count(q)
print('Number of boot events: ' + '{:,}'.format(count))

Number of boot events: 1,712
CPU times: user 10.9 ms, sys: 4.06 ms, total: 15 ms
Wall time: 446 ms

Query 2: Boot event followed by program start event

Now we'll refine the query by looking for programs launched within 4 seconds of a boot event.

%%time
q = """
MATCH (A)-[boot:Events]->(A)-[program:Events]->(A)
WHERE boot.eventID = 4608
  AND program.eventID = 4688
  AND program.epochtime >= boot.epochtime
  AND program.epochtime-boot.epochtime < 4
RETURN COUNT(*)
INTO answers
"""

count = run_query_and_count(q)
print('Number of boot & program start events: ' + '{:,}'.format(count))

Number of boot & program start events: 253,009
CPU times: user 12.8 ms, sys: 4.08 ms, total: 16.8 ms
Wall time: 691 ms

Query 3: Boot, program start, and connection

Now we add the network connection edge from A to B from port 3128 (in our mock example, this is a signature port for the bot in question). We'll restrict it such that all the pieces (boot, program start, and message) happen within 4 seconds.

%%time
q = """
MATCH (B)-[nf1:Netflow]->(A)-[boot:Events]->(A)-[program:Events]->(A)
WHERE A <> B
  AND nf1.srcPort = 3128
  AND boot.eventID = 4608
  AND program.eventID = 4688
  AND program.epochtime >= boot.epochtime
  AND nf1.epochtime >= program.epochtime
  AND nf1.epochtime-boot.epochtime < 4
RETURN COUNT(*)
INTO answers
"""
# Note the overall time limit on the sequence of the three events

count = run_query_and_count(q)
print('Number of boot, programstart, & nf0 events: ' + '{:,}'.format(count))

Number of boot, programstart, & nf0 events: 109
CPU times: user 19.8 ms, sys: 6.28 ms, total: 26.1 ms
Wall time: 3.35 s

Query 4: Full zombie reboot pattern

Finally, we add in the last network connection and match the full pattern.

%%time
q = """
MATCH (B)-[nf1]->(A)-[boot:Events]->(A)-[program:Events]->(A),
      (B)-[nf2]->(C)
WHERE A <> B AND B <> C AND A <> C
  AND nf1.srcPort = 3128
  AND boot.eventID = 4608
  AND program.eventID = 4688
  AND program.epochtime >= boot.epochtime
  AND nf1.epochtime >= program.epochtime
  AND nf1.epochtime-boot.epochtime < 4
  AND nf2.duration >= 3600
  AND nf2.epochtime < nf1.epochtime
  AND nf2.epochtime+nf2.duration >= nf1.epochtime
RETURN COUNT(*)
INTO answers
"""

count = run_query_and_count(q)
print('Number of zombie reboot events: ' + '{:,}'.format(count))

Number of zombie reboot events: 981
CPU times: user 34.2 ms, sys: 10.5 ms, total: 44.8 ms
Wall time: 14.4 s

Keys	Action
`?`	Open this help
`n`	Next page
`p`	Previous page
`s`	Search