Re: [PATCH 2/6] nfsd: return hard failure for OP_SETCLIENTID when there are too many clients.

Chuck Lever <chuck.lever@xxxxxxxxxx> · Wed, 23 Oct 2024 09:42:39 -0400

On Wed, Oct 23, 2024 at 01:37:02PM +1100, NeilBrown wrote:
> If there are more non-courteous clients than the calculated limit, we
> should fail the request rather than report a soft failure and
> encouraging the client to retry indefinitely.

Discussion:

This change has the potential to cause behavior regressions. I'm not
sure how clients will behave (eg, what error is reported to
administrators) if EXCH_ID / SETCLIENTID returns SERVERFAULT.

I can't find a more suitable status code than SERVERFAULT, however.

There is also the question of whether CREATE_SESSION, which also
might fail when server resources are over-extended, could return a
similar hard failure. (CREATE_SESSION has other spec-mandated
restrictions around using NFS4ERR_DELAY, however).

> The only hard failure allowed for EXCHANGE_ID that doesn't clearly have
> some other meaning is NFS4ERR_SERVERFAULT.  So use that, but explain why
> in a comment at each place that it is returned.
> 
> If there are courteous clients which push us over the limit, then expedite
> their removal.
> 
> This is not known to have caused a problem is production use, but

The current DELAY behavior is known to trigger an (interruptible)
infinite loop when a small-memory server can't create a new session.
Thus I believe the infinite loop behavior is a real issue that has
been observed and reported.

> testing of lots of clients reports repeated NFS4ERR_DELAY responses
> which doesn't seem helpful.

No argument from me. NFSD takes the current approach for exactly the
reason you mention above: there isn't a good choice of status code
to return in this case.

Nit: the description might better explain how this change is related
to or required by on-demand thread allocation. It seems a little
orthogonal to me right now. NBD.

> Also remove an outdated comment - we do use a slab cache.
> 
> Reviewed-by: Jeff Layton <jlayton@xxxxxxxxxx>
> Signed-off-by: NeilBrown <neilb@xxxxxxx>
> ---
>  fs/nfsd/nfs4state.c | 30 ++++++++++++++++++++----------
>  1 file changed, 20 insertions(+), 10 deletions(-)
> 
> diff --git a/fs/nfsd/nfs4state.c b/fs/nfsd/nfs4state.c
> index 56b261608af4..ca6b5b52f77d 100644
> --- a/fs/nfsd/nfs4state.c
> +++ b/fs/nfsd/nfs4state.c
> @@ -2212,21 +2212,20 @@ STALE_CLIENTID(clientid_t *clid, struct nfsd_net *nn)
>  	return 1;
>  }
>  
> -/* 
> - * XXX Should we use a slab cache ?
> - * This type of memory management is somewhat inefficient, but we use it
> - * anyway since SETCLIENTID is not a common operation.
> - */
>  static struct nfs4_client *alloc_client(struct xdr_netobj name,
>  				struct nfsd_net *nn)
>  {
>  	struct nfs4_client *clp;
>  	int i;
>  
> -	if (atomic_read(&nn->nfs4_client_count) >= nn->nfs4_max_clients) {
> +	if (atomic_read(&nn->nfs4_client_count) -
> +	    atomic_read(&nn->nfsd_courtesy_clients) >= nn->nfs4_max_clients)
> +		return ERR_PTR(-EUSERS);
> +
> +	if (atomic_read(&nn->nfs4_client_count) >= nn->nfs4_max_clients &&
> +	    atomic_read(&nn->nfsd_courtesy_clients) > 0)
>  		mod_delayed_work(laundry_wq, &nn->laundromat_work, 0);
> -		return NULL;
> -	}
> +
>  	clp = kmem_cache_zalloc(client_slab, GFP_KERNEL);
>  	if (clp == NULL)
>  		return NULL;
> @@ -3121,8 +3120,8 @@ static struct nfs4_client *create_client(struct xdr_netobj name,
>  	struct dentry *dentries[ARRAY_SIZE(client_files)];
>  
>  	clp = alloc_client(name, nn);
> -	if (clp == NULL)
> -		return NULL;
> +	if (IS_ERR_OR_NULL(clp))
> +		return clp;
>  
>  	ret = copy_cred(&clp->cl_cred, &rqstp->rq_cred);
>  	if (ret) {
> @@ -3504,6 +3503,11 @@ nfsd4_exchange_id(struct svc_rqst *rqstp, struct nfsd4_compound_state *cstate,
>  	new = create_client(exid->clname, rqstp, &verf);
>  	if (new == NULL)
>  		return nfserr_jukebox;
> +	if (IS_ERR(new))
> +		/* Protocol has no specific error for "client limit reached".
> +		 * NFS4ERR_RESOURCE is not permitted for EXCHANGE_ID
> +		 */
> +		return nfserr_serverfault;
>  	status = copy_impl_id(new, exid);
>  	if (status)
>  		goto out_nolock;
> @@ -4422,6 +4426,12 @@ nfsd4_setclientid(struct svc_rqst *rqstp, struct nfsd4_compound_state *cstate,
>  	new = create_client(clname, rqstp, &clverifier);
>  	if (new == NULL)
>  		return nfserr_jukebox;
> +	if (IS_ERR(new))
> +		/* Protocol has no specific error for "client limit reached".
> +		 * NFS4ERR_RESOURCE, while allowed for SETCLIENTID, implies
> +		 * that a smaller COMPOUND might be successful.
> +		 */
> +		return nfserr_serverfault;
>  	spin_lock(&nn->client_lock);
>  	conf = find_confirmed_client_by_name(&clname, nn);
>  	if (conf && client_has_state(conf)) {
> -- 
> 2.46.0
> 

-- 
Chuck Lever