Re: [PATCH 5/6] nfsd: add support for freeing unused session-DRC slots

"NeilBrown" <neilb@xxxxxxx> · Tue, 28 Jan 2025 09:57:06 +1100

On Tue, 28 Jan 2025, Chuck Lever wrote:
> On 1/26/25 11:08 PM, NeilBrown wrote:
> > On Sun, 19 Jan 2025, Chuck Lever wrote:
> >> On 12/11/24 4:47 PM, NeilBrown wrote:
> >>> Reducing the number of slots in the session slot table requires
> >>> confirmation from the client.  This patch adds reduce_session_slots()
> >>> which starts the process of getting confirmation, but never calls it.
> >>> That will come in a later patch.
> >>>
> >>> Before we can free a slot we need to confirm that the client won't try
> >>> to use it again.  This involves returning a lower cr_maxrequests in a
> >>> SEQUENCE reply and then seeing a ca_maxrequests on the same slot which
> >>> is not larger than we limit we are trying to impose.  So for each slot
> >>> we need to remember that we have sent a reduced cr_maxrequests.
> >>>
> >>> To achieve this we introduce a concept of request "generations".  Each
> >>> time we decide to reduce cr_maxrequests we increment the generation
> >>> number, and record this when we return the lower cr_maxrequests to the
> >>> client.  When a slot with the current generation reports a low
> >>> ca_maxrequests, we commit to that level and free extra slots.
> >>>
> >>> We use an 16 bit generation number (64 seems wasteful) and if it cycles
> >>> we iterate all slots and reset the generation number to avoid false matches.
> >>>
> >>> When we free a slot we store the seqid in the slot pointer so that it can
> >>> be restored when we reactivate the slot.  The RFC can be read as
> >>> suggesting that the slot number could restart from one after a slot is
> >>> retired and reactivated, but also suggests that retiring slots is not
> >>> required.  So when we reactive a slot we accept with the next seqid in
> >>> sequence, or 1.
> >>>
> >>> When decoding sa_highest_slotid into maxslots we need to add 1 - this
> >>> matches how it is encoded for the reply.
> >>>
> >>> se_dead is moved in struct nfsd4_session to remove a hole.
> >>>
> >>> Reviewed-by: Jeff Layton <jlayton@xxxxxxxxxx>
> >>> Signed-off-by: NeilBrown <neilb@xxxxxxx>
> >>> ---
> >>>    fs/nfsd/nfs4state.c | 94 ++++++++++++++++++++++++++++++++++++++++-----
> >>>    fs/nfsd/nfs4xdr.c   |  5 ++-
> >>>    fs/nfsd/state.h     |  6 ++-
> >>>    fs/nfsd/xdr4.h      |  2 -
> >>>    4 files changed, 92 insertions(+), 15 deletions(-)
> >>>
> >>> diff --git a/fs/nfsd/nfs4state.c b/fs/nfsd/nfs4state.c
> >>> index fd9473d487f3..a2d1f97b8a0e 100644
> >>> --- a/fs/nfsd/nfs4state.c
> >>> +++ b/fs/nfsd/nfs4state.c
> >>> @@ -1910,17 +1910,69 @@ gen_sessionid(struct nfsd4_session *ses)
> >>>    #define NFSD_MIN_HDR_SEQ_SZ  (24 + 12 + 44)
> >>>    
> >>>    static void
> >>> -free_session_slots(struct nfsd4_session *ses)
> >>> +free_session_slots(struct nfsd4_session *ses, int from)
> >>>    {
> >>>    	int i;
> >>>    
> >>> -	for (i = 0; i < ses->se_fchannel.maxreqs; i++) {
> >>> +	if (from >= ses->se_fchannel.maxreqs)
> >>> +		return;
> >>> +
> >>> +	for (i = from; i < ses->se_fchannel.maxreqs; i++) {
> >>>    		struct nfsd4_slot *slot = xa_load(&ses->se_slots, i);
> >>>    
> >>> -		xa_erase(&ses->se_slots, i);
> >>> +		/*
> >>> +		 * Save the seqid in case we reactivate this slot.
> >>> +		 * This will never require a memory allocation so GFP
> >>> +		 * flag is irrelevant
> >>> +		 */
> >>> +		xa_store(&ses->se_slots, i, xa_mk_value(slot->sl_seqid), 0);
> >>>    		free_svc_cred(&slot->sl_cred);
> >>>    		kfree(slot);
> >>>    	}
> >>> +	ses->se_fchannel.maxreqs = from;
> >>> +	if (ses->se_target_maxslots > from)
> >>> +		ses->se_target_maxslots = from;
> >>> +}
> >>> +
> >>> +/**
> >>> + * reduce_session_slots - reduce the target max-slots of a session if possible
> >>> + * @ses:  The session to affect
> >>> + * @dec:  how much to decrease the target by
> >>> + *
> >>> + * This interface can be used by a shrinker to reduce the target max-slots
> >>> + * for a session so that some slots can eventually be freed.
> >>> + * It uses spin_trylock() as it may be called in a context where another
> >>> + * spinlock is held that has a dependency on client_lock.  As shrinkers are
> >>> + * best-effort, skiping a session is client_lock is already held has no
> >>> + * great coast
> >>> + *
> >>> + * Return value:
> >>> + *   The number of slots that the target was reduced by.
> >>> + */
> >>> +static int __maybe_unused
> >>> +reduce_session_slots(struct nfsd4_session *ses, int dec)
> >>> +{
> >>> +	struct nfsd_net *nn = net_generic(ses->se_client->net,
> >>> +					  nfsd_net_id);
> >>> +	int ret = 0;
> >>> +
> >>> +	if (ses->se_target_maxslots <= 1)
> >>> +		return ret;
> >>> +	if (!spin_trylock(&nn->client_lock))
> >>> +		return ret;
> >>> +	ret = min(dec, ses->se_target_maxslots-1);
> >>> +	ses->se_target_maxslots -= ret;
> >>> +	ses->se_slot_gen += 1;
> >>> +	if (ses->se_slot_gen == 0) {
> >>> +		int i;
> >>> +		ses->se_slot_gen = 1;
> >>> +		for (i = 0; i < ses->se_fchannel.maxreqs; i++) {
> >>> +			struct nfsd4_slot *slot = xa_load(&ses->se_slots, i);
> >>> +			slot->sl_generation = 0;
> >>> +		}
> >>> +	}
> >>> +	spin_unlock(&nn->client_lock);
> >>> +	return ret;
> >>>    }
> >>>    
> >>>    /*
> >>> @@ -1968,6 +2020,7 @@ static struct nfsd4_session *alloc_session(struct nfsd4_channel_attrs *fattrs,
> >>>    	}
> >>>    	fattrs->maxreqs = i;
> >>>    	memcpy(&new->se_fchannel, fattrs, sizeof(struct nfsd4_channel_attrs));
> >>> +	new->se_target_maxslots = i;
> >>>    	new->se_cb_slot_avail = ~0U;
> >>>    	new->se_cb_highest_slot = min(battrs->maxreqs - 1,
> >>>    				      NFSD_BC_SLOT_TABLE_SIZE - 1);
> >>> @@ -2081,7 +2134,7 @@ static void nfsd4_del_conns(struct nfsd4_session *s)
> >>>    
> >>>    static void __free_session(struct nfsd4_session *ses)
> >>>    {
> >>> -	free_session_slots(ses);
> >>> +	free_session_slots(ses, 0);
> >>>    	xa_destroy(&ses->se_slots);
> >>>    	kfree(ses);
> >>>    }
> >>> @@ -2684,6 +2737,9 @@ static int client_info_show(struct seq_file *m, void *v)
> >>>    	seq_printf(m, "session slots:");
> >>>    	list_for_each_entry(ses, &clp->cl_sessions, se_perclnt)
> >>>    		seq_printf(m, " %u", ses->se_fchannel.maxreqs);
> >>> +	seq_printf(m, "\nsession target slots:");
> >>> +	list_for_each_entry(ses, &clp->cl_sessions, se_perclnt)
> >>> +		seq_printf(m, " %u", ses->se_target_maxslots);
> >>>    	spin_unlock(&clp->cl_lock);
> >>>    	seq_puts(m, "\n");
> >>>    
> >>> @@ -3674,10 +3730,10 @@ nfsd4_exchange_id_release(union nfsd4_op_u *u)
> >>>    	kfree(exid->server_impl_name);
> >>>    }
> >>>    
> >>> -static __be32 check_slot_seqid(u32 seqid, u32 slot_seqid, bool slot_inuse)
> >>> +static __be32 check_slot_seqid(u32 seqid, u32 slot_seqid, u8 flags)
> >>>    {
> >>>    	/* The slot is in use, and no response has been sent. */
> >>> -	if (slot_inuse) {
> >>> +	if (flags & NFSD4_SLOT_INUSE) {
> >>>    		if (seqid == slot_seqid)
> >>>    			return nfserr_jukebox;
> >>>    		else
> >>> @@ -3686,6 +3742,8 @@ static __be32 check_slot_seqid(u32 seqid, u32 slot_seqid, bool slot_inuse)
> >>>    	/* Note unsigned 32-bit arithmetic handles wraparound: */
> >>>    	if (likely(seqid == slot_seqid + 1))
> >>>    		return nfs_ok;
> >>> +	if ((flags & NFSD4_SLOT_REUSED) && seqid == 1)
> >>> +		return nfs_ok;
> >>>    	if (seqid == slot_seqid)
> >>>    		return nfserr_replay_cache;
> >>>    	return nfserr_seq_misordered;
> >>> @@ -4236,8 +4294,7 @@ nfsd4_sequence(struct svc_rqst *rqstp, struct nfsd4_compound_state *cstate,
> >>>    	dprintk("%s: slotid %d\n", __func__, seq->slotid);
> >>>    
> >>>    	trace_nfsd_slot_seqid_sequence(clp, seq, slot);
> >>> -	status = check_slot_seqid(seq->seqid, slot->sl_seqid,
> >>> -					slot->sl_flags & NFSD4_SLOT_INUSE);
> >>> +	status = check_slot_seqid(seq->seqid, slot->sl_seqid, slot->sl_flags);
> >>>    	if (status == nfserr_replay_cache) {
> >>>    		status = nfserr_seq_misordered;
> >>>    		if (!(slot->sl_flags & NFSD4_SLOT_INITIALIZED))
> >>> @@ -4262,6 +4319,12 @@ nfsd4_sequence(struct svc_rqst *rqstp, struct nfsd4_compound_state *cstate,
> >>>    	if (status)
> >>>    		goto out_put_session;
> >>>    
> >>> +	if (session->se_target_maxslots < session->se_fchannel.maxreqs &&
> >>> +	    slot->sl_generation == session->se_slot_gen &&
> >>> +	    seq->maxslots <= session->se_target_maxslots)
> >>> +		/* Client acknowledged our reduce maxreqs */
> >>> +		free_session_slots(session, session->se_target_maxslots);
> >>> +
> >>>    	buflen = (seq->cachethis) ?
> >>>    			session->se_fchannel.maxresp_cached :
> >>>    			session->se_fchannel.maxresp_sz;
> >>> @@ -4272,9 +4335,11 @@ nfsd4_sequence(struct svc_rqst *rqstp, struct nfsd4_compound_state *cstate,
> >>>    	svc_reserve(rqstp, buflen);
> >>>    
> >>>    	status = nfs_ok;
> >>> -	/* Success! bump slot seqid */
> >>> +	/* Success! accept new slot seqid */
> >>>    	slot->sl_seqid = seq->seqid;
> >>> +	slot->sl_flags &= ~NFSD4_SLOT_REUSED;
> >>>    	slot->sl_flags |= NFSD4_SLOT_INUSE;
> >>> +	slot->sl_generation = session->se_slot_gen;
> >>>    	if (seq->cachethis)
> >>>    		slot->sl_flags |= NFSD4_SLOT_CACHETHIS;
> >>>    	else
> >>> @@ -4291,9 +4356,11 @@ nfsd4_sequence(struct svc_rqst *rqstp, struct nfsd4_compound_state *cstate,
> >>>    	 * the client might use.
> >>>    	 */
> >>>    	if (seq->slotid == session->se_fchannel.maxreqs - 1 &&
> >>> +	    session->se_target_maxslots >= session->se_fchannel.maxreqs &&
> >>>    	    session->se_fchannel.maxreqs < NFSD_MAX_SLOTS_PER_SESSION) {
> >>>    		int s = session->se_fchannel.maxreqs;
> >>>    		int cnt = DIV_ROUND_UP(s, 5);
> >>> +		void *prev_slot;
> >>>    
> >>>    		do {
> >>>    			/*
> >>> @@ -4303,18 +4370,25 @@ nfsd4_sequence(struct svc_rqst *rqstp, struct nfsd4_compound_state *cstate,
> >>>    			 */
> >>>    			slot = kzalloc(slot_bytes(&session->se_fchannel),
> >>>    				       GFP_NOWAIT);
> >>> +			prev_slot = xa_load(&session->se_slots, s);
> >>> +			if (xa_is_value(prev_slot) && slot) {
> >>> +				slot->sl_seqid = xa_to_value(prev_slot);
> >>> +				slot->sl_flags |= NFSD4_SLOT_REUSED;
> >>> +			}
> >>>    			if (slot &&
> >>>    			    !xa_is_err(xa_store(&session->se_slots, s, slot,
> >>>    						GFP_NOWAIT))) {
> >>>    				s += 1;
> >>>    				session->se_fchannel.maxreqs = s;
> >>> +				session->se_target_maxslots = s;
> >>>    			} else {
> >>>    				kfree(slot);
> >>>    				slot = NULL;
> >>>    			}
> >>>    		} while (slot && --cnt > 0);
> >>>    	}
> >>> -	seq->maxslots = session->se_fchannel.maxreqs;
> >>> +	seq->maxslots = max(session->se_target_maxslots, seq->maxslots);
> >>> +	seq->target_maxslots = session->se_target_maxslots;
> >>>    
> >>>    out:
> >>>    	switch (clp->cl_cb_state) {
> >>> diff --git a/fs/nfsd/nfs4xdr.c b/fs/nfsd/nfs4xdr.c
> >>> index 53fac037611c..4dcb03cd9292 100644
> >>> --- a/fs/nfsd/nfs4xdr.c
> >>> +++ b/fs/nfsd/nfs4xdr.c
> >>> @@ -1884,7 +1884,8 @@ nfsd4_decode_sequence(struct nfsd4_compoundargs *argp,
> >>>    		return nfserr_bad_xdr;
> >>>    	seq->seqid = be32_to_cpup(p++);
> >>>    	seq->slotid = be32_to_cpup(p++);
> >>> -	seq->maxslots = be32_to_cpup(p++);
> >>> +	/* sa_highest_slotid counts from 0 but maxslots  counts from 1 ... */
> >>> +	seq->maxslots = be32_to_cpup(p++) + 1;
> >>>    	seq->cachethis = be32_to_cpup(p);
> >>>    
> >>>    	seq->status_flags = 0;
> >>> @@ -4968,7 +4969,7 @@ nfsd4_encode_sequence(struct nfsd4_compoundres *resp, __be32 nfserr,
> >>>    	if (nfserr != nfs_ok)
> >>>    		return nfserr;
> >>>    	/* sr_target_highest_slotid */
> >>> -	nfserr = nfsd4_encode_slotid4(xdr, seq->maxslots - 1);
> >>> +	nfserr = nfsd4_encode_slotid4(xdr, seq->target_maxslots - 1);
> >>>    	if (nfserr != nfs_ok)
> >>>    		return nfserr;
> >>>    	/* sr_status_flags */
> >>> diff --git a/fs/nfsd/state.h b/fs/nfsd/state.h
> >>> index aad547d3ad8b..4251ff3c5ad1 100644
> >>> --- a/fs/nfsd/state.h
> >>> +++ b/fs/nfsd/state.h
> >>> @@ -245,10 +245,12 @@ struct nfsd4_slot {
> >>>    	struct svc_cred sl_cred;
> >>>    	u32	sl_datalen;
> >>>    	u16	sl_opcnt;
> >>> +	u16	sl_generation;
> >>>    #define NFSD4_SLOT_INUSE	(1 << 0)
> >>>    #define NFSD4_SLOT_CACHETHIS	(1 << 1)
> >>>    #define NFSD4_SLOT_INITIALIZED	(1 << 2)
> >>>    #define NFSD4_SLOT_CACHED	(1 << 3)
> >>> +#define NFSD4_SLOT_REUSED	(1 << 4)
> >>>    	u8	sl_flags;
> >>>    	char	sl_data[];
> >>>    };
> >>> @@ -321,7 +323,6 @@ struct nfsd4_session {
> >>>    	u32			se_cb_slot_avail; /* bitmap of available slots */
> >>>    	u32			se_cb_highest_slot;	/* highest slot client wants */
> >>>    	u32			se_cb_prog;
> >>> -	bool			se_dead;
> >>>    	struct list_head	se_hash;	/* hash by sessionid */
> >>>    	struct list_head	se_perclnt;
> >>>    	struct nfs4_client	*se_client;
> >>> @@ -331,6 +332,9 @@ struct nfsd4_session {
> >>>    	struct list_head	se_conns;
> >>>    	u32			se_cb_seq_nr[NFSD_BC_SLOT_TABLE_SIZE];
> >>>    	struct xarray		se_slots;	/* forward channel slots */
> >>> +	u16			se_slot_gen;
> >>> +	bool			se_dead;
> >>> +	u32			se_target_maxslots;
> >>>    };
> >>>    
> >>>    /* formatted contents of nfs4_sessionid */
> >>> diff --git a/fs/nfsd/xdr4.h b/fs/nfsd/xdr4.h
> >>> index 382cc1389396..c26ba86dbdfd 100644
> >>> --- a/fs/nfsd/xdr4.h
> >>> +++ b/fs/nfsd/xdr4.h
> >>> @@ -576,9 +576,7 @@ struct nfsd4_sequence {
> >>>    	u32			slotid;			/* request/response */
> >>>    	u32			maxslots;		/* request/response */
> >>>    	u32			cachethis;		/* request */
> >>> -#if 0
> >>>    	u32			target_maxslots;	/* response */
> >>> -#endif /* not yet */
> >>>    	u32			status_flags;		/* response */
> >>>    };
> >>>    
> >>
> >> Hi Neil -
> >>
> >> I've found some misbehavior which I've bisected to this commit.
> > 
> > Hi Chuck,
> >   could you please confirm that it really was this commit that you
> >   bisected to?  Not the next one?
> 
> It's this one. I included the hunk that introduces the misbehavior
> below. It's when the server starts returning a different value for
> target_highest_slotid in the SEQUENCE result. The target_highest
> is 63 -- the number the server has in its slot table. The maxslots
> value is smaller.
> 
> In the working case, these two values never differ.
> 
> 
> >   Because this commit never reduces ->se_target_maxslots, so the
> >   patch which you say removed the symptom should be a no-op.
> 
> It's not the reducing of target_highest that's the problem. Rather
> it's that the target_highest and max in-use slot IDs are different
> values for a brief period after a reconnect.
> 
> That triggers the client to think that the server has reduced its
> slot table size, so the client shrinks its slot table. The server has
> not actually shrunken it, however, so it continues to expect the client
> to use the large slot sequence numbers for those slots.
> 
> When the client starts to use one of those slots again, it uses a
> sequence number of 1, and that fails.
> 
> 
> >   Even if it was the next commit I'm struggling to pin down the
> >   problem.  Here is my current analysis - partly to ensure I can present
> >   it clearly.
> > 
> >   The evidence suggests that the client has retired a slot that the
> >   server hasn't.  This happens when nfs41_set_server_slotid_locked()
> >   calls nfsd4_shrink_slot_table(), and nothing will happen if any slots
> >   before the new limit are still in use.  If the server reduces
> >   its idea of the target when the client isn't even using that many,
> >   the slots can be freed immediately that the client gets a reply
> >   indicating the new highest_slot number from the server.
> > 
> >   The server will not free these slots immediately but will wait to get a
> >   confirmation from the client that it has accepted the new limit.  But,
> >   importantly, the server will not increase the limit that it sends to
> >   the client until after it has has a chance to free the retired slots.
> >   If the server doesn't increase the limit, then the client won't try to
> >   use the retired slots...
> > 
> >   Do you still have the network trace which chows the error?  Would I be
> >   able to look at it?
> 
> Sending via WeTransfer.

Thanks.  I see the problem clearly now.  It isn't so much that 'high
slot' and 'target' are different, it is that they are both wrong.
'highslot' (which is the max the server will accept) is typically 0 or
4, and 'target' is 63.

The 'highest slot' has been copied from what the client said was the max
currently in use.  I don't know where the 'target' came from, maybe from
the previous reply sent.

This bug was introduced in 
   nfsd: allocate new session-based DRC slots on demand.
I should have move the seq->maxslots assignment after the "out:", not
before.

I'll send a patch.

> 
> But also, it's easy enough to reproduce.
> 
> Build your server with CONFIG_FAIL_SUNRPC set. Reboot into the new
> kernel.
> 
> Before each test, run this script on the server:
> 
> #!/usr/bin/bash
> 
> cd /sys/kernel/debug/fail_sunrpc/
> 
> echo Y > ignore-cache-wait
> echo Y > ignore-client-disconnect
> echo 24847 > interval
> echo 97 > times
> echo 100 > probability
> 
> exit 0
> 
> On the client, run fstests with an NFSv4.1 mount. It will usually hang
> within the first 15 tests.
> 

I might give that a try - thanks.

NeilBrown