4.1. An Introduction to Graph Analysis with xGT¶
4.1.1. Setting up the xGT Environment¶
For this demonstration, we’ll be running the xGT Server on an Amazon EC2 instance and interacting with it over a secure SSH tunnel from our laptop. More details and alternative methods for getting xGT running can be found at docs.trovares.com.
4.1.1.1. Install xgt
on Our Local Machine¶
!pip install --quiet --upgrade xgt
4.1.1.2. Checking the xGT Server Is up¶
# Assume "cloud" is the name of the server platform
!ssh cloud ps -e | grep xgtd
4.1.1.3. Establishing a Secure Tunnel¶
# Assume "ec2-user" is a username on the server platform
!ssh -fNL 4367:localhost:4367 ec2-user@cloud
!ps -A | grep 'ssh[ ]-fNL'
4.1.2. Getting Started with xGT¶
First, let’s import the package and ensure our connection is good.
import xgt
conn = xgt.Connection()
conn.set_default_namespace('lanl')
conn
<xgt.connection.Connection at 0x11073cfd0>
4.1.3. Loading Data¶
We’ll be using the LANL Unified Host and Network Dataset, a set of netflow and host event data collected on an internal Los Alamos National Lab network.
Our goal will be to turn about 100GB of CSV files into a single connected graph.
4.1.3.1. Step 1: Create a Graph¶
Vertices: Los Alamos uses a free-form, anonymized string called a “device” as a host identifier (analogous to an IP address). We’ll use these as vertices in our graph:
[conn.drop_frame(_) for _ in ['Netflow', 'HostEvents', 'Devices']]
[True, True, True]
devices = conn.create_vertex_frame(
name='Devices',
schema=[['device', xgt.TEXT]],
key='device')
devices
<xgt.graph.VertexFrame at 0x11073ca90>
Edges: The LANL dataset contains two types of data: netflow and windows log events. Of the log events recorded, some describe events within a host/device (e.g., reboots), and some describe authentication events that may be between devices (e.g., login from device A to B). We’ll call the authentication events AuthEvents and the others we’ll call HostEvents. In this notebook we load only the Netflow data and HostEvents.
netflow = conn.create_edge_frame(
name='Netflow',
schema=[['epoch_time', xgt.INT],
['duration', xgt.INT],
['src_device', xgt.TEXT],
['dst_device', xgt.TEXT],
['protocol', xgt.INT],
['src_port', xgt.INT],
['dst_port', xgt.INT],
['src_packets', xgt.INT],
['dst_packets', xgt.INT],
['src_bytes', xgt.INT],
['dst_bytes', xgt.INT]],
source=devices,
target=devices,
source_key='src_device',
target_key='dst_device')
netflow
<xgt.graph.EdgeFrame at 0x11073c610>
host_events = conn.create_edge_frame(
name='HostEvents',
schema=[['epoch_time', xgt.INT],
['event_id', xgt.INT],
['log_host', xgt.TEXT],
['user_name', xgt.TEXT],
['domain_name', xgt.TEXT],
['logon_id', xgt.INT],
['process_name', xgt.TEXT],
['process_id', xgt.INT],
['parent_process_name', xgt.TEXT],
['parent_process_id', xgt.INT]],
source=devices,
target=devices,
source_key='log_host',
target_key='log_host')
host_events
<xgt.graph.EdgeFrame at 0x11073ccd0>
4.1.3.2. Step 2: Load the Data¶
With all the data types described, we can actually add data fitting those descriptions directly from data files. We load these directly into the edge or vertex structures they correspond to.
# Utility to pretty-print the data currently in xGT
def print_data_summary():
print('Devices (vertices): {:,}'.format(devices.num_vertices))
print('Netflow (edges): {:,}'.format(netflow.num_edges))
print('Host Events (edges): {:,}'.format(host_events.num_edges))
print_data_summary()
Devices (vertices): 0
Netflow (edges): 0
Host Events (edges): 0
Load the 1-sided host event data:
%%time
urls = ["https://datasets.trovares.com/LANL/xgt/wls_day-02_1v.csv"]
host_events.load(urls)
print_data_summary()
Devices (vertices): 10,086
Netflow (edges): 0
Host Events (edges): 16,781,773
CPU times: user 13.9 ms, sys: 18.2 ms, total: 32.1 ms
Wall time: 51.3 s
Load the netflow data:
%%time
urls = ["https://datasets.trovares.com/LANL/xgt/nf_day-02.csv"]
netflow.load(urls)
print_data_summary()
Devices (vertices): 31,324
Netflow (edges): 115,949,436
Host Events (edges): 16,781,773
CPU times: user 46.3 ms, sys: 73.8 ms, total: 120 ms
Wall time: 3min 56s
4.1.4. Querying Our Graph¶
We’ll be looking for a mock pattern, similar to one that might be used to detect bot-net behavior. The pattern reflects an infected host (a) which is connecting up to a bot-net command and control node (b) with an exfiltration connection to a collection node (c).
Some device A boots up and, within a short amount of time, starts up a program.
Shortly afterwards, device A sends a message to some other device B.
Device B has a long-standing connection to another device C, which has been open for at least an hour, started before A booted, and remained open after A sent a message to B.
# Query helper function
import time
def run_query_and_count(query, parameters = None):
conn.drop_frame('Answers')
start_time = time.time()
conn.wait_for_metrics()
wait_time = time.time() - start_time
if wait_time > 30:
print('Time to wait for metrics: {:3,.2f}'.format(wait_time))
conn.run_job(query, parameters = parameters)
# Retrieve count
table = conn.get_frame('Answers')
count = table.get_data()[0][0]
return count
4.1.4.1. Query 1: Search for Just a Boot Event¶
Normally, we might not know exactly which pattern to look for, and refining that pattern would be part of the analytic process. We’ll start by just finding devices with boot events.
%%time
q = """
MATCH (a)-[boot:HostEvents]->(a)
WHERE boot.event_id = 4608
RETURN count(*)
INTO Answers
"""
count = run_query_and_count(q)
print('Number of boot events: ' + '{:,}'.format(count))
Number of boot events: 1,891
CPU times: user 4.23 ms, sys: 2.98 ms, total: 7.2 ms
Wall time: 454 ms
4.1.4.2. Query 2: Boot Event Followed by Program Start Event¶
Now we’ll refine the query by looking for programs launched within 4 seconds of a boot event.
%%time
q = """
MATCH (a)-[boot:HostEvents]->(a)-[program:HostEvents]->(a)
WHERE boot.event_id = 4608
AND program.event_id = 4688
AND program.epoch_time >= boot.epoch_time
AND program.epoch_time - boot.epoch_time < $max_time_window
RETURN count(*)
INTO Answers
"""
count = run_query_and_count(q, parameters = {'max_time_window':4})
print('Number of boot & program start events: ' + '{:,}'.format(count))
Number of boot & program start events: 547,759
CPU times: user 6.12 ms, sys: 4.09 ms, total: 10.2 ms
Wall time: 1.06 s
4.1.4.3. Query 3: Boot, Program Start, and Connection¶
Now we add the network connection edge from A to B from port 3128 (in our mock example, this is a signature port for the bot in question). We’ll restrict it such that all the pieces (boot, program start, and message) happen within 4 seconds.
%%time
q = """
MATCH (a)-[boot:HostEvents]->(a)-[program:HostEvents]->(a)
<-[nf1:Netflow]-(b)
WHERE a <> b
AND nf1.src_port = 3128
AND boot.event_id = 4608
AND program.event_id = 4688
AND program.epoch_time >= boot.epoch_time
AND nf1.epoch_time >= program.epoch_time
AND nf1.epoch_time - boot.epoch_time < $max_time_window
RETURN count(*)
INTO Answers
"""
# Note the overall time limit on the sequence of the three events
count = run_query_and_count(q, parameters = {'max_time_window':4})
print('Number of boot, programstart, & nf0 events: ' + '{:,}'.format(count))
Number of boot, programstart, & nf0 events: 75
CPU times: user 7.48 ms, sys: 4.97 ms, total: 12.4 ms
Wall time: 3.97 s
4.1.4.4. Query 4: Full Zombie Reboot Pattern¶
Finally, we add in the last network connection and match the full pattern.
%%time
q = """
MATCH (a)-[boot:HostEvents]->(a)-[program:HostEvents]->(a)
<-[nf1:Netflow]-(b)-[nf2:Netflow]->(c)
WHERE a <> b AND b <> c AND a <> c
AND nf1.src_port = 3128
AND boot.event_id = 4608
AND program.event_id = 4688
AND program.epoch_time >= boot.epoch_time
AND nf1.epoch_time >= program.epoch_time
AND nf1.epoch_time - boot.epoch_time < $max_time_window
AND nf2.duration >= $min_session_duration
AND nf2.epoch_time < nf1.epoch_time
AND nf2.epoch_time + nf2.duration >= nf1.epoch_time
RETURN count(*)
INTO Answers
"""
count = run_query_and_count(q, parameters = {'max_time_window':4, 'min_session_duration':3600})
print('Number of zombie reboot events: ' + '{:,}'.format(count))
Number of zombie reboot events: 11,925
CPU times: user 6.35 ms, sys: 5.89 ms, total: 12.2 ms
Wall time: 6.92 s