Re: [PATCH nf,v3] netfilter: nf_tables: coalesce multiple notifications into one skbuff

Phil Sutter <phil@xxxxxx> · Thu, 3 Sep 2020 10:47:14 +0200

Hi Pablo,

On Wed, Sep 02, 2020 at 06:54:42PM +0200, Pablo Neira Ayuso wrote:
> On Wed, Sep 02, 2020 at 06:39:34PM +0200, Phil Sutter wrote:
> > On Wed, Sep 02, 2020 at 06:37:43PM +0200, Pablo Neira Ayuso wrote:
> > > On x86_64, each notification results in one skbuff allocation which
> > > consumes at least 768 bytes due to the skbuff overhead.
> > > 
> > > This patch coalesces several notifications into one single skbuff, so
> > > each notification consumes at least ~211 bytes, that ~3.5 times less
> > > memory consumption. As a result, this is reducing the chances to exhaust
> > > the netlink socket receive buffer.
> > > 
> > > Rule of thumb is that each notification batch only contains netlink
> > > messages whose report flag is the same, nfnetlink_send() requires this
> > > to do appropriately delivery to userspace, either via unicast (echo
> > > mode) or multicast (monitor mode).
> > > 
> > > The skbuff control buffer is used to annotate the report flag for later
> > > handling at the new coalescing routine.
> > > 
> > > The batch skbuff notification size is NLMSG_GOODSIZE, using a larger
> > > skbuff would allow for more socket receiver buffer savings (to amortize
> > > the cost of the skbuff even more), however, going over that size might
> > > break userspace applications, so let's be conservative and stick to
> > > NLMSG_GOODSIZE.
> > > 
> > > Reported-by: Phil Sutter <phil@xxxxxx>
> > > Signed-off-by: Pablo Neira Ayuso <pablo@xxxxxxxxxxxxx>
> > 
> > Acked-by: Phil Sutter <phil@xxxxxx>
> 
> Thanks, I'll place this into nf.git

Thanks!

> BTW, I assume this mitigates the problem that Eric reported? Is it
> not so easy to trigger the problem after this patch?

Eric plans to push zones individually into the kernel from firewalld so
the problem shouldn't occur anymore unless one uses a ridiculously large
zone.

> I forgot to say, probably it would be good to monitor
> /proc/net/netlink to catch how busy the socket receive buffer is
> getting with your firewalld ruleset.

The socket doesn't live long enough to monitor it this way, but I tested
at which point things start failing again:

In firewalld, I see startup errors when having more than eight zones
configured. This is not too much, but given that we're talking about a
restrictive environment and the above change is planned anyway, it's not
a real problem.

The simple reproducer script I pasted earlier fails if the number of
rules exceeds 382. The error message is:

| netlink: Error: Could not process rule: Message too long

So I assume it is simply exhausting netlink send buffer space.

BTW: Outside of lxc, my script still succeeds for 100k rules and 1M
rules triggers OOM killer. :)

Cheers, Phil