Re: [PATCH 6.1.y] net: tls: handle backlogging of crypto requests

Trond Myklebust <trondmy@xxxxxxxxxxxxxxx> · Mon, 9 Sep 2024 17:56:12 +0000

On Mon, 2024-09-09 at 16:36 +0000, Oleksandr Tymoshenko wrote:
> > > nfs41_init_clientid does not signal a failure condition from
> > > nfs4_proc_exchange_id and nfs4_proc_create_session to a client
> > > which
> > > may
> > > lead to mount syscall indefinitely blocked in the following stack
> 
> > NACK. This will break all sorts of recovery scenarios, because it
> > doesn't distinguish between an initial 'mount' and a server reboot
> > recovery situation.
> > Even in the case where we are in the initial mount, it also doesn't
> > distinguish between transient errors such as NFS4ERR_DELAY or
> > reboot
> > errors such as NFS4ERR_STALE_CLIENTID, etc.
> 
> > Exactly what is the scenario that is causing your hang? Let's try
> > to
> > address that with a more targeted fix.
> 
> The scenario is as follows: there are several NFS servers and several
> production machines with multiple NFS mounts. This is a containerized
> multi-tennant workflow so every tennant gets its own NFS mount to
> access their
> data. At some point nfs41_init_clientid fails in the initial
> mount.nfs call
> and all subsequent mount.nfs calls just hang in
> nfs_wait_client_init_complete
> until the original one, where nfs4_proc_exchange_id has failed, is
> killed.
> 
> The cause of the nfs41_init_clientid failure in the production case
> is a timeout.
> The following error message is observed in logs:
>   NFS: state manager: lease expired failed on NFSv4 server <ip> with
> error 110
> 

How about something like the following fix then?
8<-----------------------------------------------