Re: nfnetlink: Busy-loop in nfnetlink_rcv_msg()

Pablo Neira Ayuso <pablo@xxxxxxxxxxxxx> · Sun, 23 Aug 2020 14:04:34 +0200

Hi Phil,

On Sat, Aug 22, 2020 at 01:06:15AM +0200, Phil Sutter wrote:
> Hi,
> 
> Starting firewalld with two active zones in an lxc container provokes a
> situation in which nfnetlink_rcv_msg() loops indefinitely, because
> nc->call_rcu() (nf_tables_getgen() in this case) returns -EAGAIN every
> time.
> 
> I identified netlink_attachskb() as the originator for the above error
> code. The conditional leading to it looks like this:
> 
> | if ((atomic_read(&sk->sk_rmem_alloc) > sk->sk_rcvbuf ||
> |      test_bit(NETLINK_S_CONGESTED, &nlk->state))) {
> |         [...]
> |         if (!*timeo) {
> 
> *timeo is zero, so this seems to be a non-blocking socket. Both
> NETLINK_S_CONGESTED bit is set and sk->sk_rmem_alloc exceeds
> sk->sk_rcvbuf.
> 
> From user space side, firewalld seems to simply call sendto() and the
> call never returns.
> 
> How to solve that? I tried to find other code which does the same, but I
> haven't found one that does any looping. Should nfnetlink_rcv_msg()
> maybe just return -EAGAIN to the caller if it comes from call_rcu
> backend?

It's a bug in the netlink frontend, which erroneously reports -EAGAIN
to the nfnetlink when the socket buffer is full, see:

https://patchwork.ozlabs.org/project/netfilter-devel/patch/20200823115536.16631-1-pablo@xxxxxxxxxxxxx/

> This happening only in an lxc container may be due to some setsockopt()
> calls not being allowed. In particular, setsockopt(SO_RCVBUFFORCE)
> returns EPERM.

SO_RCVBUFFORCE fails with EPERM if CAP_NET_ADMIN is not available.

> The value of sk_rcvbuf is 425984, BTW. sk_rmem_alloc is 426240. In user
> space, I see a call to setsockopt(SO_RCVBUF) with value 4194304. No idea
> if this is related and how.

Next problem is to track why socket buffer is getting full with
GET_GENID.

firewalld heavily uses NLM_F_ECHO, there I can see how it can easily
reach the default socket buffer size, but with GET_GENID I'm not sure
yet, probably the problem is elsewhere but it manifests in GET_GENID
because it's the first thing that is done when sending a batch (maybe
there are unread messages in the socket buffer, you might check
/proc/net/netlink to see if the socket buffer keeps growing as
firewalld moves on).

Is this easy to reproduce? Or does this happens after some time of
firewalld execution?