Hi Phil, On Mon, Aug 24, 2020 at 03:11:04PM +0200, Phil Sutter wrote: [...] > On Sun, Aug 23, 2020 at 02:04:34PM +0200, Pablo Neira Ayuso wrote: > > Hi Phil, > > > > On Sat, Aug 22, 2020 at 01:06:15AM +0200, Phil Sutter wrote: > > > Hi, > > > > > > Starting firewalld with two active zones in an lxc container provokes a > > > situation in which nfnetlink_rcv_msg() loops indefinitely, because > > > nc->call_rcu() (nf_tables_getgen() in this case) returns -EAGAIN every > > > time. > > > > > > I identified netlink_attachskb() as the originator for the above error > > > code. The conditional leading to it looks like this: > > > > > > | if ((atomic_read(&sk->sk_rmem_alloc) > sk->sk_rcvbuf || > > > | test_bit(NETLINK_S_CONGESTED, &nlk->state))) { > > > | [...] > > > | if (!*timeo) { > > > > > > *timeo is zero, so this seems to be a non-blocking socket. Both > > > NETLINK_S_CONGESTED bit is set and sk->sk_rmem_alloc exceeds > > > sk->sk_rcvbuf. > > > > > > From user space side, firewalld seems to simply call sendto() and the > > > call never returns. > > > > > > How to solve that? I tried to find other code which does the same, but I > > > haven't found one that does any looping. Should nfnetlink_rcv_msg() > > > maybe just return -EAGAIN to the caller if it comes from call_rcu > > > backend? > > > > It's a bug in the netlink frontend, which erroneously reports -EAGAIN > > to the nfnetlink when the socket buffer is full, see: > > > > https://patchwork.ozlabs.org/project/netfilter-devel/patch/20200823115536.16631-1-pablo@xxxxxxxxxxxxx/ > > Obviously this avoids the lockup. As correctly assumed by Florian, > firewalld startup fails instead. (The daemon keeps running, but an error > message is printed indicating that initial ruleset setup failed.) Thanks for confirming, I'll apply this patch to nf.git. > [...] > > > The value of sk_rcvbuf is 425984, BTW. sk_rmem_alloc is 426240. In user > > > space, I see a call to setsockopt(SO_RCVBUF) with value 4194304. No idea > > > if this is related and how. > > > > Next problem is to track why socket buffer is getting full with > > GET_GENID. > > > > firewalld heavily uses NLM_F_ECHO, there I can see how it can easily > > reach the default socket buffer size, but with GET_GENID I'm not sure > > yet, probably the problem is elsewhere but it manifests in GET_GENID > > because it's the first thing that is done when sending a batch (maybe > > there are unread messages in the socket buffer, you might check > > /proc/net/netlink to see if the socket buffer keeps growing as > > firewalld moves on). > > Yes, it happens only for echo mode. With your fix in place, I also see > what firewalld is trying to do: The JSON input leading to the error is > huge (~72k characters). I suspect that GET_GENID just happens to be the > last straw. Or my debugging was faulty somehow and netlink_attachskb() > really got called via a different code-path. > > > Is this easy to reproduce? Or does this happens after some time of > > firewalld execution? > > The necessary lxd setup aside, it's pretty trivial: launch an instance > of images:centos/8/amd64, install firewalld therein, add two zone files > and start firewalld. It happens immediately, so two active zones already > make firewalld generate enough rules to exceed the buffer space. > > On Sun, Aug 23, 2020 at 01:55:36PM +0200, Pablo Neira Ayuso wrote: > > Frontend callback reports EAGAIN to nfnetlink to retry a command, this > > is used to signal that module autoloading is required. Unfortunately, > > nlmsg_unicast() reports EAGAIN in case the receiver socket buffer gets > > full, so it enters a busy-loop. > > > > This patch updates nfnetlink_unicast() to turn EAGAIN into ENOBUFS and > > to use nlmsg_unicast(). Remove the flags field in nfnetlink_unicast() > > since this is always MSG_DONTWAIT in the existing code which is exactly > > what nlmsg_unicast() passes to netlink_unicast() as parameter. > > > > Fixes: 96518518cc41 ("netfilter: add nftables") > > Reported-by: Phil Sutter <phil@xxxxxx> > > Signed-off-by: Pablo Neira Ayuso <pablo@xxxxxxxxxxxxx> > > This indeed "fixes" the problem. Or rather, exposes the actual problem > in echo-related code, namely the tendency to exhaust socket buffers. > > So the problem we're facing is that while user space still waits for > sendmsg() to complete, receive buffer fills up. Is it possible to buffer > the data in kernel somewhere else so user space has a chance to call > recvmsg()? There several possibilities, just a few that can be explored: * I made a quick patch to batch several netlink messages coming as reply to the NLM_F_ECHO request into one single skbuff. If you look at the _notify() functions in the kernel, this is currently taking one single skbuff for one message which adds a bit of overhead (the skbuff truesize is used for the socket buffer accounting). I'm measuring here on x86_64 that each command takes 768 bytes. With a quick patch I'm batching several reply netlink messages into one single skbuff, now each commands takes ~120 bytes (well, size depends on how many expressions you use actually). This means this can handle ~3550 commands vs. the existing ~555 commands (assuming very small sk_rmem_alloc 426240 as the one you're reporting). Even if this does not fix your problem (you refer to 72k chars, not sure how many commands this is), this is probably good to have anyway, this decreasing memory consumption by 85%. This will also make event reporting (monitor mode) more reliable through netlink (it's the same codepath). Note that userspace requires no changes to support batching mode: libmnl's mnl_cb_run() keeps iterating over the buffer that was received until all netlink messages are consumed. The quick patch is incomplete, I just want to prove the saving in terms of memory. I'll give it another spin and submit this for review. * Probably recover the cookie idea: firewalld specifies a cookie that identifies the rule from userspace, so there is no need for NLM_F_ECHO. Eric mentioned he would like to delete rules by cookie, this should be possible by looking up for the handle from the cookie to delete rules. The cookie is stored in NFTA_RULE_USERDATA so this is only meaningful to userspace. No kernel changes are required (this is supported for a bit of time already). Note: The cookie could be duplicated. Since the cookie is allocated by userspace, it's up to userspace to ensure uniqueness. In case cookies are duplicated, if you delete a rule by cookie then Rule deletion by cookie requires dumping the whole ruleset though (slow). However, I think firewalld keeps a rule cache, so it uses the rule handle to delete the rule instead (no need to dump the cache then). Therefore, the cookie is only used to fetch the rule handle. With the rule cookie in place, firewalld can send the batch, then make a NLM_F_DUMP request after sending the batch to retrieve the handle from the cookie instead of using NLM_F_ECHO. The batch send + NLM_F_DUMP barely consumes 16KBytes per recvmsg() call. The send + dump is not atomic though. Because NLM_F_ECHO is only used to get back the rule handle, right?