Re: 4.19: Traced deadlock during xfrm_user module load

Thomas Jarosch <thomas.jarosch@xxxxxxxxxxxxx> · Thu, 27 Jun 2019 17:46:29 +0200

Hi Florian,

You wrote on Tue, Jun 25, 2019 at 06:53:44PM +0200:
> Thanks for this detailed analysis.
> In this specific case I think this is enough:
> 
> diff --git a/net/netfilter/nfnetlink.c b/net/netfilter/nfnetlink.c
> index 92077d459109..61ba92415480 100644
> --- a/net/netfilter/nfnetlink.c
> +++ b/net/netfilter/nfnetlink.c
> @@ -578,7 +578,8 @@ static int nfnetlink_bind(struct net *net, int group)
>         ss = nfnetlink_get_subsys(type << 8);
>         rcu_read_unlock();
>         if (!ss)
> -               request_module("nfnetlink-subsys-%d", type);
> +               request_module_nowait("nfnetlink-subsys-%d", type);
>         return 0;
>  }
>  #endif

thanks for the patch! We finally found an easy way to reproduce the deadlock,
the following commands instantly trigger the problem on our machines:

    rmmod nf_conntrack_netlink
    rmmod xfrm_user
    conntrack -e NEW -E & modprobe -v xfrm_user

Note: the "-e" filter is needed to trigger the problematic
code path in the kernel.

We were worried that using "_nowait" would introduce other race conditions,
since the requested service might not be available by the time it is required.

On the other hand, if we understand correctly, it seems that after
"nfnetlink_bind()", the caller will listen on the socket for messages
regardless whether the needed modules are loaded, loading or unloaded.
To verify this we added a three second sleep during the initialisation of
nf_conntrack_netlink. The events started to appear after
the delayed init was completed.

If this is the case, then using "_nowait" should suffice as a fix
for the problem. Could you please confirm these assumptions
and give us some piece of mind?

Best regards,
Juliana Rodrigueiro and Thomas Jarosch