merge

Merges flows which were split across multiple records due to active timeout.

usage: python3 -m flow_models.merge [-h] [-i {csv_flow,pipe,nfcapd,binary}]
                                    [-o {csv_flow,binary,append,extend,none}]
                                    [-O OUTPUT] [--skip-in SKIP_IN]
                                    [--count-in COUNT_IN]
                                    [--skip-out SKIP_OUT]
                                    [--count-out COUNT_OUT]
                                    [--filter-expr FILTER_EXPR]
                                    [-I INACTIVE_TIMEOUT] [-A ACTIVE_TIMEOUT]
                                    in_files [in_files ...]

Positional Arguments

in_files: input files or directories

Named Arguments

-i, --in-format

Possible choices: csv_flow, pipe, nfcapd, binary

format of input files

Default: 'nfcapd'

-o, --out-format

Possible choices: csv_flow, binary, append, extend, none

format of output

Default: 'csv_flow'

-O, --output

file or directory for output

Default: '-'

--skip-in

number of flows to skip at the beginning of input

Default: 0

--count-in

limit for number of flows to read from input

--skip-out

number of flows to skip after filtering

Default: 0

--count-out

limit for number of flows to output after filtering

--filter-expr

expression of filter

-I, --inactive-timeout

inactive timeout in seconds

Default: 15.0

-A, --active-timeout

active timeout in seconds

Default: 300.0

This tool can be used to merge flow records which were split during the collection into multiple records due to active timeout.

User should specify active and inactive timeout values which were used during the records collection to correctly merge flow records.

To filter flow records, the filter expressions should be specified. Filter expression should use the Python syntax. Bitwise (&, |, ~) operators should be used instead logical ones (and, or, not). The following fields are available:

af, prot, inif, outif, sa0, sa1, sa2, sa3, da0, da1, da2, da3, sp, dp, first, first_ms, last, last_ms, packets, octets, aggs

Skipping of flow records can be done with skip_in, count_in, skip_out, count_out parameters. They specify how many flow records should be skipped (skip_in) and then read (count_in) from input and to be skipped (skip_out) and written (count_out) after filtering.

Example: (merges flows from the cleaned directory and writes output to the merged directory)

flow_models.merge -i nfcapd -o binary -I 15 -A 300 -O merged cleaned

In all hardware and many software exporters, long-lasting flows may become split due to active timeout and reported as multiple flow records. Such flow records have to be found and merged back in order to obtain accurate flow length, size or duration values. The merge tool available in our framework can be used for that purpose. Additionally, it filters out erroneously split records. The tool processes all flow records sequentially and performs all calculations using only integers to ensure precision and reproducibility. This is possible thanks to Python’s unlimited width integer support.

The tool takes flow records in any supported format as an input and outputs merged flow records in binary or CSV format. Each merged flow record contains aggs field, which tells how many flow records were merged back into that particular aggregate flow record. A user should specify both active and inactive timeouts used in the collection process when calling the command to ensure the correctness of merge operation.