Re: [PATCH] sunrpc: add a rpc_clnt shutdown control in debugfs

Jeff Layton <jlayton@xxxxxxxxxx> · Thu, 13 Mar 2025 09:15:44 -0400

On Wed, 2025-03-12 at 22:31 +0000, Trond Myklebust wrote:
> On Wed, 2025-03-12 at 10:37 -0400, Jeff Layton wrote:
> > On Wed, 2025-03-12 at 09:52 -0400, Benjamin Coddington wrote:
> > > On 12 Mar 2025, at 9:36, Jeff Layton wrote:
> > > 
> > > > There have been confirmed reports where a container with an NFS
> > > > mount
> > > > inside it dies abruptly, along with all of its processes, but the
> > > > NFS
> > > > client sticks around and keeps trying to send RPCs after the
> > > > networking
> > > > is gone.
> > > > 
> > > > We have a reproducer where if we SIGKILL a container with an NFS
> > > > mount,
> > > > the RPC clients will stick around indefinitely. The orchestrator
> > > > does a MNT_DETACH unmount on the NFS mount, and then tears down
> > > > the
> > > > networking while there are still RPCs in flight.
> > > > 
> > > > Recently new controls were added[1] that allow shutting down an
> > > > NFS
> > > > mount. That doesn't help here since the mount namespace is
> > > > detached from
> > > > any tasks at this point.
> > > 
> > > That's interesting - seems like the orchestrator could just reorder
> > > its
> > > request to shutdown before detaching the mount namespace.  Not an
> > > objection,
> > > just wondering why the MNT_DETACH must come first.
> > > 
> > 
> > The reproducer we have is to systemd-nspawn a container, mount up an
> > NFS mount inside it, start some I/O on it with fio and then kill -9
> > the
> > systemd running inside the container. There isn't much the
> > orchestrator
> > (root-level systemd) can do to at that point other than clean up
> > what's
> > left.
> > 
> > I'm still working on a way to reliably detect when this has happened.
> > For now, we just have to notice that some clients aren't dying.
> > 
> > > > Transplant shutdown_client() to the sunrpc module, and give it a
> > > > more
> > > > distinct name. Add a new debugfs sunrpc/rpc_clnt/*/shutdown knob
> > > > that
> > > > allows the same functionality as the one in /sys/fs/nfs, but at
> > > > the
> > > > rpc_clnt level.
> > > > 
> > > > [1]: commit d9615d166c7e ("NFS: add sysfs shutdown knob").
> > > > 
> > > > Signed-off-by: Jeff Layton <jlayton@xxxxxxxxxx>
> > > 
> > > I have a TODO to patch Documentation/ for this knob mostly to write
> > > warnings
> > > because there are some potential "gotchas" here - for example you
> > > can have
> > > shared RPC clients and shutting down one of those can cause
> > > problems for a
> > > different mount (this is true today with the
> > > /sys/fs/nfs/[bdi]/shutdown
> > > knob).  Shutting down aribitrary clients will definitely break
> > > things in
> > > weird ways, its not a safe place to explore.
> > > 
> > 
> > Yes, you really do need to know what you're doing. 0200 permissions
> > are
> > essential for this file, IOW. Thanks for the R-b!
> 
> Sorry, but NACK! We should not be adding control mechanisms to debugfs.
> 

Ok. Would adding sunrpc controls under sysfs be more acceptable? I do
agree that this is a potential footgun, however. It would be nicer to
clean this situation up automagically.

> One thing that might work in situations like this is perhaps to make
> use of the fact that we are monitoring whether or not rpc_pipefs is
> mounted. So if the mount is containerised, and the orchestrator
> unmounts everything, including rpc_pipefs, we might take that as a hint
> that we should treat any future connection errors as being fatal.
> 

rpc_pipefs isn't being mounted at all in the container I'm using. I
think that's not going to be a reliable test for this.

> Otherwise, we'd have to be able to monitor the root task, and check if
> it is still alive in order to figure out if out containerised world has
> collapsed.
> 

If by the root task, you mean the initial task in the container, then
that method seems a little sketchy too. How would we determine that
from the RPC layer?

To be clear: the situation here is that we have a container with a veth
device that is communicating with the outside world. Once all of the
processes in the container exit, the veth device in the container
disappears. The rpc_xprt holds a ref on the netns though, so that
sticks around trying to retransmit indefinitely.

I think what we really need is a lightweight reference on the netns.
Something where we can tell that there are no userland tasks that care
about it anymore, so we can be more aggressive about giving up on it.

There is a "passive" refcount inside struct net, but that's not quite
what we need as it won't keep the sunrpc_net in place.

What if instead of holding a netns reference in the xprt, we have it
hold a reference on a new refcount_t that lives in sunrpc_net? Then, we
add a pre_exit pernet_ops callback that does a shutdown_client() on all
of the rpc_clnt's attached to the xprts in that netns. The pre_exit can
then just block until the sunrpc_net refcount goes to 0.

I think that would allow everything to be cleaned up properly?
-- 
Jeff Layton <jlayton@xxxxxxxxxx>