>> nfs41_init_clientid does not signal a failure condition from >> nfs4_proc_exchange_id and nfs4_proc_create_session to a client which >> may >> lead to mount syscall indefinitely blocked in the following stack > NACK. This will break all sorts of recovery scenarios, because it > doesn't distinguish between an initial 'mount' and a server reboot > recovery situation. > Even in the case where we are in the initial mount, it also doesn't > distinguish between transient errors such as NFS4ERR_DELAY or reboot > errors such as NFS4ERR_STALE_CLIENTID, etc. > Exactly what is the scenario that is causing your hang? Let's try to > address that with a more targeted fix. (re-sending with the correct subject, previous mistake was due to my tools failure) The scenario is as follows: there are several NFS servers and several production machines with multiple NFS mounts. This is a containerized multi-tennant workflow so every tennant gets its own NFS mount to access their data. At some point nfs41_init_clientid fails in the initial mount.nfs call and all subsequent mount.nfs calls just hang in nfs_wait_client_init_complete until the original one, where nfs4_proc_exchange_id has failed, is killed. The cause of the nfs41_init_clientid failure in the production case is a timeout. The following error message is observed in logs: NFS: state manager: lease expired failed on NFSv4 server <ip> with error 110