On Mon, 2024-09-09 at 16:36 +0000, Oleksandr Tymoshenko wrote: > > > nfs41_init_clientid does not signal a failure condition from > > > nfs4_proc_exchange_id and nfs4_proc_create_session to a client > > > which > > > may > > > lead to mount syscall indefinitely blocked in the following stack > > > NACK. This will break all sorts of recovery scenarios, because it > > doesn't distinguish between an initial 'mount' and a server reboot > > recovery situation. > > Even in the case where we are in the initial mount, it also doesn't > > distinguish between transient errors such as NFS4ERR_DELAY or > > reboot > > errors such as NFS4ERR_STALE_CLIENTID, etc. > > > Exactly what is the scenario that is causing your hang? Let's try > > to > > address that with a more targeted fix. > > The scenario is as follows: there are several NFS servers and several > production machines with multiple NFS mounts. This is a containerized > multi-tennant workflow so every tennant gets its own NFS mount to > access their > data. At some point nfs41_init_clientid fails in the initial > mount.nfs call > and all subsequent mount.nfs calls just hang in > nfs_wait_client_init_complete > until the original one, where nfs4_proc_exchange_id has failed, is > killed. > > The cause of the nfs41_init_clientid failure in the production case > is a timeout. > The following error message is observed in logs: > NFS: state manager: lease expired failed on NFSv4 server <ip> with > error 110 > How about something like the following fix then? 8<-----------------------------------------------