Hi Pablo, On Wed, Sep 02, 2020 at 06:54:42PM +0200, Pablo Neira Ayuso wrote: > On Wed, Sep 02, 2020 at 06:39:34PM +0200, Phil Sutter wrote: > > On Wed, Sep 02, 2020 at 06:37:43PM +0200, Pablo Neira Ayuso wrote: > > > On x86_64, each notification results in one skbuff allocation which > > > consumes at least 768 bytes due to the skbuff overhead. > > > > > > This patch coalesces several notifications into one single skbuff, so > > > each notification consumes at least ~211 bytes, that ~3.5 times less > > > memory consumption. As a result, this is reducing the chances to exhaust > > > the netlink socket receive buffer. > > > > > > Rule of thumb is that each notification batch only contains netlink > > > messages whose report flag is the same, nfnetlink_send() requires this > > > to do appropriately delivery to userspace, either via unicast (echo > > > mode) or multicast (monitor mode). > > > > > > The skbuff control buffer is used to annotate the report flag for later > > > handling at the new coalescing routine. > > > > > > The batch skbuff notification size is NLMSG_GOODSIZE, using a larger > > > skbuff would allow for more socket receiver buffer savings (to amortize > > > the cost of the skbuff even more), however, going over that size might > > > break userspace applications, so let's be conservative and stick to > > > NLMSG_GOODSIZE. > > > > > > Reported-by: Phil Sutter <phil@xxxxxx> > > > Signed-off-by: Pablo Neira Ayuso <pablo@xxxxxxxxxxxxx> > > > > Acked-by: Phil Sutter <phil@xxxxxx> > > Thanks, I'll place this into nf.git Thanks! > BTW, I assume this mitigates the problem that Eric reported? Is it > not so easy to trigger the problem after this patch? Eric plans to push zones individually into the kernel from firewalld so the problem shouldn't occur anymore unless one uses a ridiculously large zone. > I forgot to say, probably it would be good to monitor > /proc/net/netlink to catch how busy the socket receive buffer is > getting with your firewalld ruleset. The socket doesn't live long enough to monitor it this way, but I tested at which point things start failing again: In firewalld, I see startup errors when having more than eight zones configured. This is not too much, but given that we're talking about a restrictive environment and the above change is planned anyway, it's not a real problem. The simple reproducer script I pasted earlier fails if the number of rules exceeds 382. The error message is: | netlink: Error: Could not process rule: Message too long So I assume it is simply exhausting netlink send buffer space. BTW: Outside of lxc, my script still succeeds for 100k rules and 1M rules triggers OOM killer. :) Cheers, Phil