Efficient and correct time based bandwidth monitoring

Benno <b.ohnsorg@xxxxxxxxxx> · Wed, 11 Oct 2023 18:17:42 +0200

Hi there,

I want to monitor bandwidth/ throughput (on a NAT-ing IPv4 router) in a 
sliding window of n minutes correctly. Just from the wiki or the docs 
some uncertainties remain.
Named counter could be a first approach:

table inet filter {

  counter accept_https {}

  tcp dport 443 counter name accept_https accept comment "accept https"
}

Current state is to be queried with nft list ruleset | grep counter. 
Such a counter will gather statistics from start of loading the ruleset 
until eternity. A delta analysis for a 15min window could be solved in a 
later stage by a little math of a scraping tool.
A set would serve a similar purpose. This example is already more 
sophisticated to distinguish (internal) IP(v4) addresses:
define private_net = 192.168.2.0/24

table inet nftmon {
        set ip4counters {
                type ipv4_addr
                size 65535
                flags dynamic
                counter
        }

        chain forward {
                type filter hook postrouting priority filter + 1; 
policy accept;
                ip saddr $private_net add @ip4counters { ip saddr }
                ip daddr $private_net add @ip4counters { ip daddr }
        }
}

Querying it with:

nft list set inet nftmon ip4counters

is more straightforward listing only the relevant metrics.

I could further enhance this with flags timeout for the set and add a 
timeout of 15min in the add part of the rule filling the set:
ip saddr 192.168.1.0/24 add @ip4counters { ip saddr timeout 15m }

1. The first approach with a named counter and a diff logic in a later 
stage (scraping script, piece of code) moves load from nftables 
somewhere else. Is this recommended in comparison to the 
timeout-flagging of set-variant? Will a counter overflow and break 
subtraction from time to time? (Uptime is multiple months with 
sufficient traffic.)
2. For the 2nd approach I assume the single packet matching the rule 
will end up with a 15m timeout in the set. Thus no entry in the set is 
older than 15min. So the metrics from this set only span a 15min 
interval. Is this correct? Asked from a different point of view: when 
will garbage collection take place clearing the timed out values from 
the set?
3. The pure counter approach cannot be improved with a garbage 
collection configuration? This would create 24x4 15min-intervals when 
running every 15min. Scraping this in between garbage collection runs 
means missing bandwidth/ packets?

4. Is there a third more efficient/ cheaper approach to define a rule or 
rules to yield bandwidth/ througput metrics grouped by IP (or port or 
whatever the rule is made of) so that only the last n minutes are taken 
into consideration? (Precise to the minute.)
5. Querying conntrack would be later stage if bandwidth monitoring 
yields unusual activity. A counter or the set approach requires less 
ressources (CPU, memory). Is this correct?

Thanks in advance,

Benno