> On Jun 22, 2018, at 6:31 PM, Trond Myklebust <trondmy@xxxxxxxxxxxxxxx> wrote: > > On Fri, 2018-06-22 at 17:49 -0400, Chuck Lever wrote: >> Hi Bruce- >> >> >>> On Jun 22, 2018, at 1:54 PM, J. Bruce Fields <bfields@xxxxxxxxxxxx> >>> wrote: >>> >>> On Thu, Jun 21, 2018 at 04:35:33PM +0000, Manjunath Patil wrote: >>>> Presently nfserr_jukebox is being returned by nfsd for >>>> create_session >>>> request if server is unable to allocate a session slot. This may >>>> be >>>> treated as NFS4ERR_DELAY by the clients and which may continue to >>>> re-try >>>> create_session in loop leading NFSv4.1+ mounts in hung state. >>>> nfsd >>>> should return nfserr_nospc in this case as per rfc5661(section- >>>> 18.36.4 >>>> subpoint 4. Session creation). >>> >>> I don't think the spec actually gives us an error that we can use >>> to say >>> a CREATE_SESSION failed permanently for lack of resources. >> >> The current situation is that the server replies NFS4ERR_DELAY, >> and the client retries indefinitely. The goal is to let the >> client choose whether it wants to try the CREATE_SESSION again, >> try a different NFS version, or fail the mount request. >> >> Bill and I both looked at this section of RFC 5661. It seems to >> us that the use of NFS4ERR_NOSPC is appropriate and unambiguous >> in this situation, and it is an allowed status for the >> CREATE_SESSION operation. NFS4ERR_DELAY OTOH is not helpful. > > There are a range of errors which we may need to handle by destroying > the session, and then creating a new one (mainly the ones where the > client and server slot handling get out of sync). That's why returning > NFS4ERR_NOSPC in response to CREATE_SESSION is unhelpful, and is why > the only sane response by the client will be to treat it as a temporary > error. > IOW: these patches will not be acceptable, even with a rewrite, as they > are based on a flawed assumption. Fair enough. We're not attached to any particular solution/fix. So let's take "recovery of an active mount" out of the picture for a moment. The narrow problem is behavioral: during initial contact with an unfamiliar server, the server can hold off a client indefinitely by sending NFS4ERR_DELAY for example until another client unmounts. We want to find a way to allow clients to make progress when a server is short of resources. It appears that the mount(2) system call does not return as long as the server is still returning NFS4ERR_DELAY. Possibly user space is never given an opportunity to stop retrying, and thus mount.nfs gets stuck. It appears that DELAY is OK for EXCHANGE_ID too. So if a server decides to return DELAY to EXCHANGE_ID, I wonder if our client's trunking detection would be hamstrung by one bad server... -- Chuck Lever chucklever@xxxxxxxxx -- To unsubscribe from this list: send the line "unsubscribe linux-nfs" in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at http://vger.kernel.org/majordomo-info.html