On Mon, 2025-03-17 at 22:11 +0000, Trond Myklebust wrote: > On Mon, 2025-03-17 at 17:57 -0400, Jeff Layton wrote: > > On Mon, 2025-03-17 at 21:35 +0000, Trond Myklebust wrote: > > > On Mon, 2025-03-17 at 16:59 -0400, Jeff Layton wrote: > > > > We have a long-standing problem with containers that have NFS > > > > mounts > > > > in > > > > them. Best practice is to unmount gracefully, of course, but > > > > sometimes > > > > containers just spontaneously die (e.g. SIGSEGV in the init task > > > > in > > > > the > > > > container). When that happens the orchestrator will see that all > > > > of > > > > the > > > > tasks are dead, and will detach the mount namespace and kill off > > > > the > > > > network connection. > > > > > > > > If there are RPCs in flight at the time, the rpc_clnt will try to > > > > retransmit them indefinitely, but there is no hope of them ever > > > > contacting the server since nothing in userland can reach the > > > > netns > > > > at that point to fix anything. > > > > > > > > This patchset takes the approach of changing various nfs client > > > > and > > > > sunrpc objects to not hold a netns reference. Instead, when a > > > > nfs_net > > > > or > > > > sunrpc_net is exiting, all nfs_server, nfs_client and rpc_clnt > > > > objects > > > > associated with it are shut down, and the pre_exit functions > > > > block > > > > until they are gone. > > > > > > > > With this approach, when the last userland task in the container > > > > exits, > > > > the NFS and RPC clients get cleaned up automatically. As a bonus, > > > > this > > > > fixes another bug with the gssproxy RPC client that causes net > > > > namespace > > > > leaks in any container where it runs (details in the patch > > > > descriptions). > > > > > > > > > > So with this approach, what happens if the NFS mount was created in > > > a > > > container, but got bind mounted somewhere else? > > > > > > > The lifetime of these objects are tied to the net namespace. If it > > gets > > bind-mounted into a different mount namespace, while the tasks are > > setns()'ed into the correct net namespace, then I expect the mount > > would end up shut down at that point and be unusable, just like if > > you > > echo 1 into the shutdown file in sysfs. > > > > Hopefully no one is doing anything that silly. You wouldn't be able > > to > > upcall, for one thing, since there wouldn't be any more userland > > processes attached to the netns. > > > > I'll test that scenario and get back to you though. I do want to make > > sure that that's not going to lead to a crash or anything. > > I agree with you that it's not a sane scenario, and that there is no > need to try to make it work. However the user space tools are there to > allow it to happen, so we need to ensure that the kernel won't panic or > cause any new exotic hangs. Unfortunately, this does create a hang. Bind-mounting it will cause the superblock's refcount to increase, which keeps the nfs_server struct active. That holds a reference to the nfs_client, which prevents everything from coming down properly in pre_exit. I'll have to think about how we can solve that. Let me know if you have ideas. -- Jeff Layton <jlayton@xxxxxxxxxx>