Re: [PATCH 2/2] nfsd: return ENOSPC if unable to allocate a session slot

Chuck Lever <chucklever@xxxxxxxxx> · Sat, 23 Jun 2018 15:00:00 -0400

> On Jun 22, 2018, at 6:31 PM, Trond Myklebust <trondmy@xxxxxxxxxxxxxxx> wrote:
> 
> On Fri, 2018-06-22 at 17:49 -0400, Chuck Lever wrote:
>> Hi Bruce-
>> 
>> 
>>> On Jun 22, 2018, at 1:54 PM, J. Bruce Fields <bfields@xxxxxxxxxxxx>
>>> wrote:
>>> 
>>> On Thu, Jun 21, 2018 at 04:35:33PM +0000, Manjunath Patil wrote:
>>>> Presently nfserr_jukebox is being returned by nfsd for
>>>> create_session
>>>> request if server is unable to allocate a session slot. This may
>>>> be
>>>> treated as NFS4ERR_DELAY by the clients and which may continue to
>>>> re-try
>>>> create_session in loop leading NFSv4.1+ mounts in hung state.
>>>> nfsd
>>>> should return nfserr_nospc in this case as per rfc5661(section-
>>>> 18.36.4
>>>> subpoint 4. Session creation).
>>> 
>>> I don't think the spec actually gives us an error that we can use
>>> to say
>>> a CREATE_SESSION failed permanently for lack of resources.
>> 
>> The current situation is that the server replies NFS4ERR_DELAY,
>> and the client retries indefinitely. The goal is to let the
>> client choose whether it wants to try the CREATE_SESSION again,
>> try a different NFS version, or fail the mount request.
>> 
>> Bill and I both looked at this section of RFC 5661. It seems to
>> us that the use of NFS4ERR_NOSPC is appropriate and unambiguous
>> in this situation, and it is an allowed status for the
>> CREATE_SESSION operation. NFS4ERR_DELAY OTOH is not helpful.
> 
> There are a range of errors which we may need to handle by destroying
> the session, and then creating a new one (mainly the ones where the
> client and server slot handling get out of sync). That's why returning
> NFS4ERR_NOSPC in response to CREATE_SESSION is unhelpful, and is why
> the only sane response by the client will be to treat it as a temporary
> error.

> IOW: these patches will not be acceptable, even with a rewrite, as they
> are based on a flawed assumption.

Fair enough. We're not attached to any particular solution/fix.

So let's take "recovery of an active mount" out of the picture
for a moment.

The narrow problem is behavioral: during initial contact with an
unfamiliar server, the server can hold off a client indefinitely
by sending NFS4ERR_DELAY for example until another client unmounts.
We want to find a way to allow clients to make progress when a
server is short of resources.

It appears that the mount(2) system call does not return as long
as the server is still returning NFS4ERR_DELAY. Possibly user
space is never given an opportunity to stop retrying, and thus
mount.nfs gets stuck.

It appears that DELAY is OK for EXCHANGE_ID too. So if a server
decides to return DELAY to EXCHANGE_ID, I wonder if our client's
trunking detection would be hamstrung by one bad server...

--
Chuck Lever
chucklever@xxxxxxxxx

--
To unsubscribe from this list: send the line "unsubscribe linux-nfs" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html