Re: [PATCH 3/4] nfsd: replace rp_mutex to avoid deadlock in move_to_close_lru()

Chuck Lever <chuck.lever@xxxxxxxxxx> · Mon, 4 Mar 2024 18:30:54 -0500

On Mon, Mar 04, 2024 at 06:11:24PM -0500, Jeff Layton wrote:
> On Mon, 2024-03-04 at 17:54 -0500, Chuck Lever wrote:
> > On Tue, Mar 05, 2024 at 09:36:29AM +1100, NeilBrown wrote:
> > > On Tue, 05 Mar 2024, Chuck Lever wrote:
> > > > On Tue, Mar 05, 2024 at 08:45:45AM +1100, NeilBrown wrote:
> > > > > On Tue, 05 Mar 2024, Chuck Lever wrote:
> > > > > > On Mon, Mar 04, 2024 at 03:40:21PM +1100, NeilBrown wrote:
> > > > > > > move_to_close_lru() waits for sc_count to become zero while holding
> > > > > > > rp_mutex.  This can deadlock if another thread holds a reference and is
> > > > > > > waiting for rp_mutex.
> > > > > > > 
> > > > > > > By the time we get to move_to_close_lru() the openowner is unhashed and
> > > > > > > cannot be found any more.  So code waiting for the mutex can safely
> > > > > > > retry the lookup if move_to_close_lru() has started.
> > > > > > > 
> > > > > > > So change rp_mutex to an atomic_t with three states:
> > > > > > > 
> > > > > > >  RP_UNLOCK   - state is still hashed, not locked for reply
> > > > > > >  RP_LOCKED   - state is still hashed, is locked for reply
> > > > > > >  RP_UNHASHED - state is not hashed, no code can get a lock.
> > > > > > > 
> > > > > > > Use wait_var_event() to wait for either a lock, or for the owner to be
> > > > > > > unhashed.  In the latter case, retry the lookup.
> > > > > > > 
> > > > > > > Signed-off-by: NeilBrown <neilb@xxxxxxx>
> > > > > > > ---
> > > > > > >  fs/nfsd/nfs4state.c | 38 +++++++++++++++++++++++++++++++-------
> > > > > > >  fs/nfsd/state.h     |  2 +-
> > > > > > >  2 files changed, 32 insertions(+), 8 deletions(-)
> > > > > > > 
> > > > > > > diff --git a/fs/nfsd/nfs4state.c b/fs/nfsd/nfs4state.c
> > > > > > > index 690d0e697320..47e879d5d68a 100644
> > > > > > > --- a/fs/nfsd/nfs4state.c
> > > > > > > +++ b/fs/nfsd/nfs4state.c
> > > > > > > @@ -4430,21 +4430,32 @@ nfsd4_init_leases_net(struct nfsd_net *nn)
> > > > > > >  	atomic_set(&nn->nfsd_courtesy_clients, 0);
> > > > > > >  }
> > > > > > >  
> > > > > > > +enum rp_lock {
> > > > > > > +	RP_UNLOCKED,
> > > > > > > +	RP_LOCKED,
> > > > > > > +	RP_UNHASHED,
> > > > > > > +};
> > > > > > > +
> > > > > > >  static void init_nfs4_replay(struct nfs4_replay *rp)
> > > > > > >  {
> > > > > > >  	rp->rp_status = nfserr_serverfault;
> > > > > > >  	rp->rp_buflen = 0;
> > > > > > >  	rp->rp_buf = rp->rp_ibuf;
> > > > > > > -	mutex_init(&rp->rp_mutex);
> > > > > > > +	atomic_set(&rp->rp_locked, RP_UNLOCKED);
> > > > > > >  }
> > > > > > >  
> > > > > > > -static void nfsd4_cstate_assign_replay(struct nfsd4_compound_state *cstate,
> > > > > > > -		struct nfs4_stateowner *so)
> > > > > > > +static int nfsd4_cstate_assign_replay(struct nfsd4_compound_state *cstate,
> > > > > > > +				      struct nfs4_stateowner *so)
> > > > > > >  {
> > > > > > >  	if (!nfsd4_has_session(cstate)) {
> > > > > > > -		mutex_lock(&so->so_replay.rp_mutex);
> > > > > > > +		wait_var_event(&so->so_replay.rp_locked,
> > > > > > > +			       atomic_cmpxchg(&so->so_replay.rp_locked,
> > > > > > > +					      RP_UNLOCKED, RP_LOCKED) != RP_LOCKED);
> > > > > > 
> > > > > > What reliably prevents this from being a "wait forever" ?
> > > > > 
> > > > > That same thing that reliably prevented the original mutex_lock from
> > > > > waiting forever.

Note that this patch fixes a deadlock here. So clearly, there /were/
situations where "waiting forever" was possible with the mutex version
of this code.

> > > > > It waits until rp_locked is set to RP_UNLOCKED, which is precisely when
> > > > > we previously called mutex_unlock.  But it *also* aborts the wait if
> > > > > rp_locked is set to RP_UNHASHED - so it is now more reliable.
> > > > > 
> > > > > Does that answer the question?
> > > > 
> > > > Hm. I guess then we are no worse off with wait_var_event().
> > > > 
> > > > I'm not as familiar with this logic as perhaps I should be. How long
> > > > does it take for the wake-up to occur, typically?
> > > 
> > > wait_var_event() is paired with wake_up_var().
> > > The wake up happens when wake_up_var() is called, which in this code is
> > > always immediately after atomic_set() updates the variable.
> > 
> > I'm trying to ascertain the actual wall-clock time that the nfsd thread
> > is sleeping, at most. Is this going to be a possible DoS vector? Can
> > it impact the ability for the server to shut down without hanging?
> 
> Prior to this patch, there was a mutex in play here and we just released
> it to wake up the waiters. This is more or less doing the same thing, it
> just indicates the resulting state better.

Well, it adds a third state so that a recovery action can be taken
on wake-up in some cases. That avoids a deadlock, so this does count
as a bug fix.

> I doubt this will materially change how long the tasks are waiting.

It might not be a longer wait, but it still seems difficult to prove
that the wait_var_event() will /always/ be awoken somehow.

Applying for now.

-- 
Chuck Lever