Re: nfnetlink: Busy-loop in nfnetlink_rcv_msg()

Eric Garver <eric@xxxxxxxxxxx> · Wed, 26 Aug 2020 14:54:23 -0400

On Wed, Aug 26, 2020 at 05:32:19PM +0200, Pablo Neira Ayuso wrote:
> Hi Phil,
> 
> On Mon, Aug 24, 2020 at 03:11:04PM +0200, Phil Sutter wrote:
> [...]
> > On Sun, Aug 23, 2020 at 02:04:34PM +0200, Pablo Neira Ayuso wrote:
> > > Hi Phil,
> > > 
> > > On Sat, Aug 22, 2020 at 01:06:15AM +0200, Phil Sutter wrote:
> > > > Hi,
> > > > 
> > > > Starting firewalld with two active zones in an lxc container provokes a
> > > > situation in which nfnetlink_rcv_msg() loops indefinitely, because
> > > > nc->call_rcu() (nf_tables_getgen() in this case) returns -EAGAIN every
> > > > time.
> > > > 
> > > > I identified netlink_attachskb() as the originator for the above error
> > > > code. The conditional leading to it looks like this:
> > > > 
> > > > | if ((atomic_read(&sk->sk_rmem_alloc) > sk->sk_rcvbuf ||
> > > > |      test_bit(NETLINK_S_CONGESTED, &nlk->state))) {
> > > > |         [...]
> > > > |         if (!*timeo) {
> > > > 
> > > > *timeo is zero, so this seems to be a non-blocking socket. Both
> > > > NETLINK_S_CONGESTED bit is set and sk->sk_rmem_alloc exceeds
> > > > sk->sk_rcvbuf.
> > > > 
> > > > From user space side, firewalld seems to simply call sendto() and the
> > > > call never returns.
> > > > 
> > > > How to solve that? I tried to find other code which does the same, but I
> > > > haven't found one that does any looping. Should nfnetlink_rcv_msg()
> > > > maybe just return -EAGAIN to the caller if it comes from call_rcu
> > > > backend?
> > > 
> > > It's a bug in the netlink frontend, which erroneously reports -EAGAIN
> > > to the nfnetlink when the socket buffer is full, see:
> > > 
> > > https://patchwork.ozlabs.org/project/netfilter-devel/patch/20200823115536.16631-1-pablo@xxxxxxxxxxxxx/
> > 
> > Obviously this avoids the lockup. As correctly assumed by Florian,
> > firewalld startup fails instead. (The daemon keeps running, but an error
> > message is printed indicating that initial ruleset setup failed.)
> 
> Thanks for confirming, I'll apply this patch to nf.git.
> 
> > [...]
> > > > The value of sk_rcvbuf is 425984, BTW. sk_rmem_alloc is 426240. In user
> > > > space, I see a call to setsockopt(SO_RCVBUF) with value 4194304. No idea
> > > > if this is related and how.
> > > 
> > > Next problem is to track why socket buffer is getting full with
> > > GET_GENID.
> > > 
> > > firewalld heavily uses NLM_F_ECHO, there I can see how it can easily
> > > reach the default socket buffer size, but with GET_GENID I'm not sure
> > > yet, probably the problem is elsewhere but it manifests in GET_GENID
> > > because it's the first thing that is done when sending a batch (maybe
> > > there are unread messages in the socket buffer, you might check
> > > /proc/net/netlink to see if the socket buffer keeps growing as
> > > firewalld moves on).
> > 
> > Yes, it happens only for echo mode. With your fix in place, I also see
> > what firewalld is trying to do: The JSON input leading to the error is
> > huge (~72k characters). I suspect that GET_GENID just happens to be the
> > last straw. Or my debugging was faulty somehow and netlink_attachskb()
> > really got called via a different code-path.
> > 
> > > Is this easy to reproduce? Or does this happens after some time of
> > > firewalld execution?
> > 
> > The necessary lxd setup aside, it's pretty trivial: launch an instance
> > of images:centos/8/amd64, install firewalld therein, add two zone files
> > and start firewalld. It happens immediately, so two active zones already
> > make firewalld generate enough rules to exceed the buffer space.
> > 
> > On Sun, Aug 23, 2020 at 01:55:36PM +0200, Pablo Neira Ayuso wrote:
> > > Frontend callback reports EAGAIN to nfnetlink to retry a command, this
> > > is used to signal that module autoloading is required. Unfortunately,
> > > nlmsg_unicast() reports EAGAIN in case the receiver socket buffer gets
> > > full, so it enters a busy-loop.
> > > 
> > > This patch updates nfnetlink_unicast() to turn EAGAIN into ENOBUFS and
> > > to use nlmsg_unicast(). Remove the flags field in nfnetlink_unicast()
> > > since this is always MSG_DONTWAIT in the existing code which is exactly
> > > what nlmsg_unicast() passes to netlink_unicast() as parameter.
> > > 
> > > Fixes: 96518518cc41 ("netfilter: add nftables")
> > > Reported-by: Phil Sutter <phil@xxxxxx>
> > > Signed-off-by: Pablo Neira Ayuso <pablo@xxxxxxxxxxxxx>
> > 
> > This indeed "fixes" the problem. Or rather, exposes the actual problem
> > in echo-related code, namely the tendency to exhaust socket buffers.
> > 
> > So the problem we're facing is that while user space still waits for
> > sendmsg() to complete, receive buffer fills up. Is it possible to buffer
> > the data in kernel somewhere else so user space has a chance to call
> > recvmsg()?
> 
> There several possibilities, just a few that can be explored:
> 
> * I made a quick patch to batch several netlink messages coming as
>   reply to the NLM_F_ECHO request into one single skbuff. If you look
>   at the _notify() functions in the kernel, this is currently taking
>   one single skbuff for one message which adds a bit of overhead (the
>   skbuff truesize is used for the socket buffer accounting). I'm
>   measuring here on x86_64 that each command takes 768 bytes. With a
>   quick patch I'm batching several reply netlink messages into one
>   single skbuff, now each commands takes ~120 bytes (well, size
>   depends on how many expressions you use actually). This means this
>   can handle ~3550 commands vs. the existing ~555 commands (assuming
>   very small sk_rmem_alloc 426240 as the one you're reporting).
> 
>   Even if this does not fix your problem (you refer to 72k chars, not
>   sure how many commands this is), this is probably good to have
>   anyway, this decreasing memory consumption by 85%. This will also
>   make event reporting (monitor mode) more reliable through netlink
>   (it's the same codepath).
> 
>   Note that userspace requires no changes to support batching mode:
>   libmnl's mnl_cb_run() keeps iterating over the buffer that was
>   received until all netlink messages are consumed.
> 
>   The quick patch is incomplete, I just want to prove the saving in
>   terms of memory. I'll give it another spin and submit this for
>   review.
> 
> * Probably recover the cookie idea: firewalld specifies a cookie
>   that identifies the rule from userspace, so there is no need for
>   NLM_F_ECHO. Eric mentioned he would like to delete rules by cookie,
>   this should be possible by looking up for the handle from the
>   cookie to delete rules. The cookie is stored in NFTA_RULE_USERDATA
>   so this is only meaningful to userspace. No kernel changes are
>   required (this is supported for a bit of time already).

Good that it doesn't require a new kernel.

>   Note: The cookie could be duplicated. Since the cookie is allocated
>   by userspace, it's up to userspace to ensure uniqueness. In case
>   cookies are duplicated, if you delete a rule by cookie then

I think this is advantageous. It means a user/program can group rules
with a given cookie value.

>   Rule deletion by cookie requires dumping the whole ruleset though
>   (slow). However, I think firewalld keeps a rule cache, so it uses
>   the rule handle to delete the rule instead (no need to dump the
>   cache then). Therefore, the cookie is only used to fetch the rule
>   handle.
> 
>   With the rule cookie in place, firewalld can send the batch, then
>   make a NLM_F_DUMP request after sending the batch to retrieve the
>   handle from the cookie instead of using NLM_F_ECHO. The batch send +
>   NLM_F_DUMP barely consumes 16KBytes per recvmsg() call. The send +
>   dump is not atomic though.

NLM_F_DUMP has the downside that a single rule update means fetching the
whole rule set.

firewalld has a large batch when it starts up. After the initial startup
rule updates are generally small. I don't think NLM_F_DUMP is good for
firewalld's use case. The simpler solution would be for firewalld to
step the initial large batch, i.e. break it up into smaller batches.

>   Because NLM_F_ECHO is only used to get back the rule handle, right?

Right.