hist
Calculates histograms of flows length, size, duration or rate.
usage: python3 -m flow_models.hist [-h] [-i {csv_flow,pipe,nfcapd,binary}]
[-o {csv_hist,append,none}] [-O OUTPUT]
[--skip-in SKIP_IN] [--count-in COUNT_IN]
[--skip-out SKIP_OUT]
[--count-out COUNT_OUT]
[--filter-expr FILTER_EXPR] [-b BIN_EXP]
[-x {length,size,duration,rate}]
[-c ADDITIONAL_COLUMNS]
in_files [in_files ...]
Positional Arguments
- in_files
input files or directories
Named Arguments
- -i, --in-format
Possible choices: csv_flow, pipe, nfcapd, binary
format of input files
Default:
'nfcapd'- -o, --out-format
Possible choices: csv_hist, append, none
format of output
Default:
'csv_hist'- -O, --output
file or directory for output
Default:
'-'- --skip-in
number of flows to skip at the beginning of input
Default:
0- --count-in
limit for number of flows to read from input
- --skip-out
number of flows to skip after filtering
Default:
0- --count-out
limit for number of flows to output after filtering
- --filter-expr
expression of filter
- -b, --bin-exp
bin width exponent of 2
Default:
0- -x, --x-value
Possible choices: length, size, duration, rate
x axis value
Default:
'length'- -c, --additional-columns
additional column to sum
Default:
[]
Use this tool to calculate histogram of flow features.
The output is a histogram of a selected feature in csv_hist format.
Feature selection is being done with -x parameter. Additionally -b parameter can be specified, which will make histogram logarithmically binned to help reduce its size.
To filter flow records, the filter expressions should be specified. Filter expression should use the Python syntax. Bitwise (&, |, ~) operators should be used instead logical ones (and, or, not). The following fields are available:
af, prot, inif, outif, sa0, sa1, sa2, sa3, da0, da1, da2, da3, sp, dp, first, first_ms, last, last_ms, packets, octets, aggs
Skipping of flow records can be done with skip_in, count_in, skip_out, count_out parameters. They specify how many flow records should be skipped (skip_in) and then read (count_in) from input and to be skipped (skip_out) and written (count_out) after filtering.
Example: (calculates logarithmically binned histogram of flow length from the sorted directory)
flow_models.hist -i binary -x length -b 12 sorted
Fitting of mixture models does not have to be performed on complete flow records. Instead, it can be performed on histograms, calculated by binning flow records into buckets according to the selected parameter (e.g. flow length or size). Histogram files can also be easily published as they are many orders of magnitude smaller and, unlike flow records, do not contain private information such as IP addresses.
The tool takes flow records in any supported format as an input and outputs a histogram file in a CSV format. A user should specify the parameter to be binned (flow length, size, duration or rate) and additional columns to be summed in a histogram (by default packets and octets are counted, additional fields can be rate and duration). The user can also specify a parameter, which is a power-of-two defining starting point for logarithmic binning. Logarithmic binning significantly reduces the size of histogram files without affecting the quality of the fitting process noticeably.
Two implementations of the tool are available: hist and hist_np. The former is a pure Python implementation that takes advantage of unlimited width integer support in Python in order to perform more accurate calculations. The latter uses the NumPy package to perform binning, which can utilize SIMD instructions and multiple threads and is many orders of magnitude faster, but requires more memory and can introduce rounding errors due to the operation on doubles having limited precision.