Re: Efficient and correct time based bandwidth monitoring

Benno <b.ohnsorg@xxxxxxxxxx> · Fri, 13 Oct 2023 16:55:21 +0200

Thanks for the fast reply.

Am 11.10.23 um 23:52 schrieb Kerin Millar:

table inet filter {

    counter accept_https {}

    tcp dport 443 counter name accept_https accept comment "accept https"
}

Current state is to be queried with nft list ruleset | grep counter.

nft list counters inet would make more sense (--json is also supported).

Didn't try this firsthand but works. Valuable shortcut.

define private_net = 192.168.2.0/24

table inet nftmon {
          set ip4counters {
                  type ipv4_addr
                  size 65535
                  flags dynamic
                  counter
          }

          chain forward {
                  type filter hook postrouting priority filter + 1;
policy accept;
                  ip saddr $private_net add @ip4counters { ip saddr }
                  ip daddr $private_net add @ip4counters { ip daddr }
          }
}

In the absence of a timeout, it is probable that the set will become full.

Absolutely I simply postulated the timeout as part of the question. 
Nevertheless even with timeout and many IPs to route or fine grained 
metrics 65535 entries lasts only until 65534 plus one. (Distinguishing 
16 protocols leaves room for only 4096 entries.)

1. The first approach with a named counter and a diff logic in a later
stage (scraping script, piece of code) moves load from nftables
somewhere else. Is this recommended in comparison to the
timeout-flagging of set-variant? Will a counter overflow and break
subtraction from time to time? (Uptime is multiple months with
sufficient traffic.)

The manual indicates that they are signed 64-bit integers, which is quite generous. However, I am uncertain as to how they wrap around. My guess would be that they go as far as 9223372036854775807 ((1 << 63) - 1) before wrapping around to 0, because wrapping around to a negative number would be confusing. I would appreciate a confirmation from a Netfilter developer, one way or the other. Once the wrapping behaviour is confirmed, detecting such should be straightforward.

I could live with those since nagios. But the approach with sets and 
timeout makes this very rare.

2. For the 2nd approach I assume the single packet matching the rule
will end up with a 15m timeout in the set. Thus no entry in the set is
older than 15min. So the metrics from this set only span a 15min
interval. Is this correct? Asked from a different point of view: when
will garbage collection take place clearing the timed out values from
the set?

Yes, this is correct. Once an element has timed out, it shall be unceremoniously removed. Further, the timeout value may also be defined by the set itself. A potential issue with this approach is that your data collector ends up racing with the exact time at which a given element is created and/or the exact time at which it is removed.

A small uncertainty remains according to Shannon theoreme. A 15min 
timeout scraped every 3-4min plus/minus a few packets is sufficiently 
precise.

3. The pure counter approach cannot be improved with a garbage
collection configuration? This would create 24x4 15min-intervals when
running every 15min. Scraping this in between garbage collection runs
means missing bandwidth/ packets?

I'm not sure that I understand what a garbage collection configuration would entail. Do you mean for your collector to reset the counters after collecting their values? If so, I would not expect for anything to be missed, provided that the reset command is issued via the same invocation of nft(8) that instructs it to print the counters.

   nft -j 'list counters inet; reset counters inet'

Something like that or a fixed interval like 24h or fixed point in time 
like 02:00 (much larger than the scraping interval yet small enough to 
circumvent overflow or larger gaps).

Having said that, I may have uncovered a bug in the course of trying this. I shall explore the matter further.

Thanks in advance for fixing it.