Thomas Jarosch <thomas.jarosch@xxxxxxxxxxxxx> wrote: > You wrote on Tue, Jun 25, 2019 at 06:53:44PM +0200: > > Thanks for this detailed analysis. > > In this specific case I think this is enough: > > > > diff --git a/net/netfilter/nfnetlink.c b/net/netfilter/nfnetlink.c > > index 92077d459109..61ba92415480 100644 > > --- a/net/netfilter/nfnetlink.c > > +++ b/net/netfilter/nfnetlink.c > > @@ -578,7 +578,8 @@ static int nfnetlink_bind(struct net *net, int group) > > ss = nfnetlink_get_subsys(type << 8); > > rcu_read_unlock(); > > if (!ss) > > - request_module("nfnetlink-subsys-%d", type); > > + request_module_nowait("nfnetlink-subsys-%d", type); > > return 0; > > } > > #endif > > thanks for the patch! We finally found an easy way to reproduce the deadlock, > the following commands instantly trigger the problem on our machines: > > rmmod nf_conntrack_netlink > rmmod xfrm_user > conntrack -e NEW -E & modprobe -v xfrm_user > > Note: the "-e" filter is needed to trigger the problematic > code path in the kernel. > > We were worried that using "_nowait" would introduce other race conditions, > since the requested service might not be available by the time it is required. Then this code would be buggy too, there is no guarantee that a request_module() succeeds. > "nfnetlink_bind()", the caller will listen on the socket for messages > regardless whether the needed modules are loaded, loading or unloaded. > To verify this we added a three second sleep during the initialisation of > nf_conntrack_netlink. The events started to appear after > the delayed init was completed. > > If this is the case, then using "_nowait" should suffice as a fix > for the problem. Could you please confirm these assumptions > and give us some piece of mind? Yes, _nowait is safe here (and needed, as you find out). I'm away for a few hours but I plan to submit this patch officially soon.