On Wed, 2024-09-04 at 10:23 -0400, Chuck Lever wrote: > On Mon, Sep 02, 2024 at 11:57:55AM +1000, NeilBrown wrote: > > On Sun, 01 Sep 2024, syzbot wrote: > > > syzbot has found a reproducer for the following issue on: > > > > I had a poke around using the provided disk image and kernel for > > exploring. > > > > I think the problem is demonstrated by this stack : > > > > [<0>] rpc_wait_bit_killable+0x1b/0x160 > > [<0>] __rpc_execute+0x723/0x1460 > > [<0>] rpc_execute+0x1ec/0x3f0 > > [<0>] rpc_run_task+0x562/0x6c0 > > [<0>] rpc_call_sync+0x197/0x2e0 > > [<0>] rpcb_register+0x36b/0x670 > > [<0>] svc_unregister+0x208/0x730 > > [<0>] svc_bind+0x1bb/0x1e0 > > [<0>] nfsd_create_serv+0x3f0/0x760 > > [<0>] nfsd_nl_listener_set_doit+0x135/0x1a90 > > [<0>] genl_rcv_msg+0xb16/0xec0 > > [<0>] netlink_rcv_skb+0x1e5/0x430 > > > > No rpcbind is running on this host so that "svc_unregister" takes a > > long time. Maybe not forever but if a few of these get queued up all > > blocking some other thread, then maybe that pushed it over the limit. > > > > The fact that rpcbind is not running might not be relevant as the test > > messes up the network. "ping 127.0.0.1" stops working. > > > > So this bug comes down to "we try to contact rpcbind while holding a > > mutex and if that gets no response and no error, then we can hold the > > mutex for a long time". > > > > Are we surprised? Do we want to fix this? Any suggestions how? > > In the past, we've tried to address "hanging upcall" issues where > the kernel part of an administrative command needs a user space > service that isn't working or present. (eg mount needing a running > gssd) > > If NFSD is using the kernel RPC client for the upcall, then maybe > adding the RPC_TASK_SOFTCONN flag might turn the hang into an > immediate failure. > > IMO this should be addressed. > Looking at rpcb_register_call, it looks like we already set SOFTCONN if is_set is true. We probably did that assuming that we only call svc_unregister on shutdown. svc_rpcb_setup does this though: /* Remove any stale portmap registrations */ svc_unregister(serv, net); return 0; What would be the risk in just setting SOFTCONN unconditionally? -- Jeff Layton <jlayton@xxxxxxxxxx>