Thomas Jarosch <thomas.jarosch@xxxxxxxxxxxxx> wrote: > we're in the process of upgrading to kernel 4.19 and hit > a very rare lockup on boot during "xfrm_user" module load. > The tested kernel was 4.19.55. > > When the strongswan IPsec service starts, it loads the xfrm_user module. > -> modprobe hangs forever. > > Also network services like ssh or apache stop responding, > ICMP ping still works. > > By chance we had magic sysRq enabled and were able to get some meaningful stack > traces. We've rebuilt the kernel with LOCKDEP + DEBUG_INFO + DEBUG_INFO_REDUCED, > but so far failed to reproduce the issue even when hammering the suspected > deadlock case. Though it's just hammering it for a few hours yet. > > Preliminary analysis: > > "modprobe xfrm_user": > xfrm_user_init() > register_pernet_subsys() > -> grab pernet_ops_rwsem > .. > netlink_table_grab() > calls schedule() as "nl_table_users" is non-zero > > > conntrack netlink related program "info_iponline" does this in parallel: > netlink_bind() > netlink_lock_table() -> increases "nl_table_users" > nfnetlink_bind() > # does not unlock the table as it's locked by netlink_bind() > __request_module() > call_usermodehelper_exec() > > > "modprobe nf_conntrack_netlink" runs and inits nf_conntrack_netlink: > ctnetlink_init() > register_pernet_subsys() > -> blocks on "pernet_ops_rwsem" thanks to xfrm_user module > -> schedule() > -> deadlock forever > Thanks for this detailed analysis. In this specific case I think this is enough: diff --git a/net/netfilter/nfnetlink.c b/net/netfilter/nfnetlink.c index 92077d459109..61ba92415480 100644 --- a/net/netfilter/nfnetlink.c +++ b/net/netfilter/nfnetlink.c @@ -578,7 +578,8 @@ static int nfnetlink_bind(struct net *net, int group) ss = nfnetlink_get_subsys(type << 8); rcu_read_unlock(); if (!ss) - request_module("nfnetlink-subsys-%d", type); + request_module_nowait("nfnetlink-subsys-%d", type); return 0; } #endif