hist_np

Calculates histograms using multiple threads (requires numpy, much faster, but uses more memory).

usage: python3 -m flow_models.hist_np [-h] [-i {binary}]
                                      [-o {csv_hist,append,none}] [-O OUTPUT]
                                      [--skip-in SKIP_IN]
                                      [--count-in COUNT_IN]
                                      [--skip-out SKIP_OUT]
                                      [--count-out COUNT_OUT]
                                      [--filter-expr FILTER_EXPR] [-b BIN_EXP]
                                      [-x {length,size}]
                                      [-c ADDITIONAL_COLUMNS]
                                      [--measure-memory]
                                      in_files [in_files ...]

Positional Arguments

in_files: input files or directories

Named Arguments

-i, --in-format

Possible choices: binary

format of input files

Default: 'binary'

-o, --out-format

Possible choices: csv_hist, append, none

format of output

Default: 'csv_hist'

-O, --output

file or directory for output

Default: '-'

--skip-in

number of flows to skip at the beginning of input

Default: 0

--count-in

limit for number of flows to read from input

--skip-out

number of flows to skip after filtering

Default: 0

--count-out

limit for number of flows to output after filtering

--filter-expr

expression of filter

-b, --bin-exp

bin width exponent of 2

Default: 0

-x, --x-value

Possible choices: length, size

x axis value

Default: 'length'

-c, --additional-columns

additional column to sum

Default: []

--measure-memory

collect and print memory statistics

Default: False

Use this tool to calculate histogram of flow features.

The output is a histogram of a selected feature in csv_hist format.

Feature selection is being done with -x parameter. Additionally -b parameter can be specified, which will make histogram logarithmically binned to help reduce its size.

To filter flow records, the filter expressions should be specified. Filter expression should use the Python syntax. Bitwise (&, |, ~) operators should be used instead logical ones (and, or, not). The following fields are available:

af, prot, inif, outif, sa0, sa1, sa2, sa3, da0, da1, da2, da3, sp, dp, first, first_ms, last, last_ms, packets, octets, aggs

Skipping of flow records can be done with skip_in and count_in parameters. They specify how many flow records should be skipped (skip_in) and then read (count_in) from input.

Example: (calculates logarithmically binned histogram of flow length from the sorted directory)

flow_models.hist -i binary -x length -b 12 sorted

Fitting of mixture models does not have to be performed on complete flow records. Instead, it can be performed on histograms, calculated by binning flow records into buckets according to the selected parameter (e.g. flow length or size). Histogram files can also be easily published as they are many orders of magnitude smaller and, unlike flow records, do not contain private information such as IP addresses.

The tool takes flow records in any supported format as an input and outputs a histogram file in a CSV format. A user should specify the parameter to be binned (flow length, size, duration or rate) and additional columns to be summed in a histogram (by default packets and octets are counted, additional fields can be rate and duration). The user can also specify a parameter, which is a power-of-two defining starting point for logarithmic binning. Logarithmic binning significantly reduces the size of histogram files without affecting the quality of the fitting process noticeably.

Two implementations of the tool are available: hist and hist_np. The former is a pure Python implementation that takes advantage of unlimited width integer support in Python in order to perform more accurate calculations. The latter uses the NumPy package to perform binning, which can utilize SIMD instructions and multiple threads and is many orders of magnitude faster, but requires more memory and can introduce rounding errors due to the operation on doubles having limited precision.