File formats

The framework currently supports the following flow records formats:

  • pipenfdump pipe format

  • nfcapdnfdump binary format

  • csv_flow – comma-separated values text format (see below)

  • binary – separate binary array file for each field (see below)

Additionally, the framework currently supports the following formats:

  • csv_hist – comma-separated values flow histogram text format (see below)

  • csv_series – line-separated time series of packets and bytes (see below)

csv_flow

File contains the following fields:

af, prot, inif, outif, sa0, sa1, sa2, sa3, da0, da1, da2, da3, sp, dp, first, first_ms, last, last_ms, packets, octets, aggs
  • af – address family

  • prot – IP protocol number

  • inif – input interface number

  • outif – output interface number

  • sa0 - sa3 – consecutive 32-bit words forming source IP address

  • da0 - da3 – consecutive 32-bit words forming destination IP address

  • sp – source transport layer port

  • dp – destination transport layer port

  • first – timestamp of first packet (seconds component)

  • first_ms – timestamp of first packet (milliseconds component)

  • last – timestamp of last packet (seconds component)

  • last_ms – timestamp of last packet (milliseconds component)

  • packets – number of packets (flow length)

  • octets – number of octets (bytes) (flow size)

  • aggs – number of aggregated flow records forming this record

binary

The binary file format is used as an effective internal on-disk format to exchange data between tools included in the framework. Each flows trace is a directory, which contains several binary files. Each binary file stores one field as an array of binary values with a specified type.

File naming schema is the following:

{field_name}.{dtype}

Suffix dtype specifies the type of binary object stored in the file (using array type codes):

Type code

C Type

b

signed char

B

unsigned char

h

signed short

H

unsigned short

i

signed int

I

unsigned int

l

signed long

L

unsigned long

q

signed long long

Q

unsigned long long

f

float

d

double

Such a storage schema has several advantages:

  • fields can be distributed independently (for example one can share flow records without sa* and da* address fields for privacy reasons)

  • fields can be compressed/uncompressed selectively (important when processing data which barely fits on disks)

  • additional or custom fields can be trivially added or removed

  • supports storage of any field using any object type (signedness, precision)

  • files can be memory-mapped as numerical arrays (unlike IPFIX, nfcapd or any other structured/TLV format)

  • the format is so simple that files can be memory-mapped into any big data processing software and is future-proof

  • memory-mapping is IO and cache efficient (columnar memory layout allows applications to avoid unnecessary IO and accelerate analytical processing performance on modern CPUs and GPUs)

Example:

agh_2015/
└── day-01
    ├── af.B             ─┐
    ├── da0.I             │
    ├── da1.I             │
    ├── da2.I             │
    ├── da3.I             │
    ├── dp.H              │
    ├── inif.H            │ key
    ├── outif.H           │ fields
    ├── prot.B            │
    ├── sa0.I             │
    ├── sa1.I             │
    ├── sa2.I             │
    ├── sa3.I             │
    ├── sp.H             ─┘
    ├── first.I          ─┐
    ├── first_ms.H        │
    ├── last.I            │ value
    ├── last_ms.H         │ fields
    ├── octets.Q          │
    └── packets.Q        ─┘

csv_hist

File contains the following fields:

bin_lo, bin_hi, flows_sum, packets_sum, octets_sum, duration_sum, rate_sum, aggs_sum
  • bin_lo – lower edge of a bin (inclusive)

  • bin_hi – upper edge of a bin (exclusive)

  • flows_sum – number of flows within a particular bin

  • packets_sum – sum of packets of all flows within a bin

  • octets_sum – sum of octets of all flows within a bin

  • duration_sum – sum of duration of all flows within a bin (in milliseconds)

  • rate_sum – sum of rates of all flows within a bin (in bps)

  • aggs_sum – sum of aggregated flows of all flows within a bin

Histograms can be calculated using hist or hist_np modules. The former is a pure Python implementation which can take advantage of unlimited width integer support in Python in order to perform more accurate calculations. The latter uses the numpy package to perform binning, which can utilise SIMD instructions and multiple threads and is therefore many orders of magnitude faster but requires more memory and can introduce rounding errors due to the operation on doubles having limited precision. Both tools output a CSV file which can be directly used to plot a histogram, CDF or PDF of a particular flow feature.

The framework user can specify a parameter b, which is a power-of-two defining starting point for logarithmic binning. For example, b = 12 means that bin widths will start increasing for values > 4096 (for lower values bin width will be equal to one). Therefore, values between 4096-8192 would be binned into bins of width 2, between 8192-16384 into bins of width 4, etc.

csv_series

The format consists of files that provide information about the number of flows, packets, and bytes transmitted on a specific link for each second since the beginning of the day. Each line in the file corresponds to a subsequent second, and the filenames are based on the number of days that have elapsed since the Unix epoch.