Re: [PATCH RFC 0/9] nfs/sunrpc: stop holding netns references in client-side NFS and RPC objects

Jeff Layton <jlayton@xxxxxxxxxx> · Tue, 18 Mar 2025 07:30:13 -0400

On Mon, 2025-03-17 at 22:11 +0000, Trond Myklebust wrote:
> On Mon, 2025-03-17 at 17:57 -0400, Jeff Layton wrote:
> > On Mon, 2025-03-17 at 21:35 +0000, Trond Myklebust wrote:
> > > On Mon, 2025-03-17 at 16:59 -0400, Jeff Layton wrote:
> > > > We have a long-standing problem with containers that have NFS
> > > > mounts
> > > > in
> > > > them. Best practice is to unmount gracefully, of course, but
> > > > sometimes
> > > > containers just spontaneously die (e.g. SIGSEGV in the init task
> > > > in
> > > > the
> > > > container). When that happens the orchestrator will see that all
> > > > of
> > > > the
> > > > tasks are dead, and will detach the mount namespace and kill off
> > > > the
> > > > network connection.
> > > > 
> > > > If there are RPCs in flight at the time, the rpc_clnt will try to
> > > > retransmit them indefinitely, but there is no hope of them ever
> > > > contacting the server since nothing in userland can reach the
> > > > netns
> > > > at that point to fix anything.
> > > > 
> > > > This patchset takes the approach of changing various nfs client
> > > > and
> > > > sunrpc objects to not hold a netns reference. Instead, when a
> > > > nfs_net
> > > > or
> > > > sunrpc_net is exiting, all nfs_server, nfs_client and rpc_clnt
> > > > objects
> > > > associated with it are shut down, and the pre_exit functions
> > > > block
> > > > until they are gone.
> > > > 
> > > > With this approach, when the last userland task in the container
> > > > exits,
> > > > the NFS and RPC clients get cleaned up automatically. As a bonus,
> > > > this
> > > > fixes another bug with the gssproxy RPC client that causes net
> > > > namespace
> > > > leaks in any container where it runs (details in the patch
> > > > descriptions).
> > > > 
> > > 
> > > So with this approach, what happens if the NFS mount was created in
> > > a
> > > container, but got bind mounted somewhere else?
> > > 
> > 
> > The lifetime of these objects are tied to the net namespace. If it
> > gets
> > bind-mounted into a different mount namespace, while the tasks are
> > setns()'ed into the correct net namespace, then I expect the mount
> > would end up shut down at that point and be unusable, just like if
> > you
> > echo 1 into the shutdown file in sysfs.
> > 
> > Hopefully no one is doing anything that silly. You wouldn't be able
> > to
> > upcall, for one thing, since there wouldn't be any more userland
> > processes attached to the netns.
> > 
> > I'll test that scenario and get back to you though. I do want to make
> > sure that that's not going to lead to a crash or anything.
> 
> I agree with you that it's not a sane scenario, and that there is no
> need to try to make it work. However the user space tools are there to
> allow it to happen, so we need to ensure that the kernel won't panic or
> cause any new exotic hangs.

Unfortunately, this does create a hang.

Bind-mounting it will cause the superblock's refcount to increase,
which keeps the nfs_server struct active. That holds a reference to the
nfs_client, which prevents everything from coming down properly in
pre_exit.

I'll have to think about how we can solve that. Let me know if you have
ideas.
-- 
Jeff Layton <jlayton@xxxxxxxxxx>