Re: [PATCH 1/2 v2] NFSv4: Fix a livelock when CLOSE pre-emptively bumps state sequence

Trond Myklebust <trondmy@xxxxxxxxxxxxxxx> · Tue, 22 Sep 2020 21:46:31 +0000

On Tue, 2020-09-22 at 17:15 -0400, Trond Myklebust wrote:
> On Tue, 2020-09-22 at 17:08 -0400, Benjamin Coddington wrote:
> > On 22 Sep 2020, at 15:48, Trond Myklebust wrote:
> > 
> > > On Tue, 2020-09-22 at 15:15 -0400, Benjamin Coddington wrote:
> > > > Since commit 0e0cb35b417f ("NFSv4: Handle NFS4ERR_OLD_STATEID
> > > > in
> > > > CLOSE/OPEN_DOWNGRADE") the following livelock may occur if a
> > > > CLOSE
> > > > races
> > > > with the update of the nfs_state:
> > > > 
> > > > Process 1           Process 2           Server
> > > > =========           =========           ========
> > > >  OPEN file
> > > >                     OPEN file
> > > >                                         Reply OPEN (1)
> > > >                                         Reply OPEN (2)
> > > >  Update state (1)
> > > >  CLOSE file (1)
> > > >                                         Reply OLD_STATEID (1)
> > > >  CLOSE file (2)
> > > >                                         Reply CLOSE (-1)
> > > >                     Update state (2)
> > > >                     wait for state change
> > > >  OPEN file
> > > >                     wake
> > > >  CLOSE file
> > > >  OPEN file
> > > >                     wake
> > > >  CLOSE file
> > > >  ...
> > > >                     ...
> > > > 
> > > > As long as the first process continues updating state, the
> > > > second
> > > > process
> > > > will fail to exit the loop in
> > > > nfs_set_open_stateid_locked().  This
> > > > livelock
> > > > has been observed in generic/168.
> > > > 
> > > > Fix this by detecting the case in
> > > > nfs_need_update_open_stateid()
> > > > and
> > > > then exit the loop if:
> > > >  - the state is NFS_OPEN_STATE, and
> > > >  - the stateid sequence is > 1, and
> > > >  - the stateid doesn't match the current open stateid
> > > > 
> > > > Fixes: 0e0cb35b417f ("NFSv4: Handle NFS4ERR_OLD_STATEID in
> > > > CLOSE/OPEN_DOWNGRADE")
> > > > Cc: stable@xxxxxxxxxxxxxxx # v5.4+
> > > > Signed-off-by: Benjamin Coddington <bcodding@xxxxxxxxxx>
> > > > ---
> > > >  fs/nfs/nfs4proc.c | 27 +++++++++++++++++----------
> > > >  1 file changed, 17 insertions(+), 10 deletions(-)
> > > > 
> > > > diff --git a/fs/nfs/nfs4proc.c b/fs/nfs/nfs4proc.c
> > > > index 6e95c85fe395..9db29a438540 100644
> > > > --- a/fs/nfs/nfs4proc.c
> > > > +++ b/fs/nfs/nfs4proc.c
> > > > @@ -1588,18 +1588,25 @@ static void
> > > > nfs_test_and_clear_all_open_stateid(struct nfs4_state *state)
> > > >  static bool nfs_need_update_open_stateid(struct nfs4_state
> > > > *state,
> > > >  		const nfs4_stateid *stateid)
> > > >  {
> > > > -	if (test_bit(NFS_OPEN_STATE, &state->flags) == 0 ||
> > > > -	    !nfs4_stateid_match_other(stateid, &state-
> > > > >open_stateid)) {
> > > > -		if (stateid->seqid == cpu_to_be32(1))
> > > > +	bool state_matches_other =
> > > > nfs4_stateid_match_other(stateid,
> > > > &state->open_stateid);
> > > > +	bool seqid_one = stateid->seqid == cpu_to_be32(1);
> > > > +
> > > > +	if (test_bit(NFS_OPEN_STATE, &state->flags)) {
> > > > +		/* The common case - we're updating to a new
> > > > sequence
> > > > number */
> > > > +		if (state_matches_other &&
> > > > nfs4_stateid_is_newer(stateid, &state->open_stateid)) {
> > > > +			nfs_state_log_out_of_order_open_stateid
> > > > (state,
> > > > stateid);
> > > > +			return true;
> > > > +		}
> > > > +	} else {
> > > > +		/* This is the first OPEN */
> > > > +		if (!state_matches_other && seqid_one) {
> > > 
> > > Why are we looking at the value of state_matches_other here? If
> > > the
> > > NFS_OPEN_STATE flags is not set, then the state->open_stateid is
> > > not
> > > initialised, so how does it make sense to look at a comparison
> > > with 
> > > the
> > > incoming stateid?
> > 
> > You're right - it doesn't make sense. I failed to clean it up out
> > of
> > my
> > decision matrix.  I'll send a v3 after it gets tested overnight.
> > 
> > Thanks for the look, and if you have another moment - what do you
> > think
> > about not bumping the seqid in the CLOSE/OPEN_DOWNGRADE path on 
> > OLD_STATEID
> > if the state's refcount > 1?  This would optimize away a lot of
> > recovery 
> > for
> > races like this, but I wonder if it would be possible to end up
> > with
> > two
> > OPEN_DOWNGRADEs dueling that would never recover.
> > 
> 
> It would lead to a return of the infinite loop in cases where the
> client misses a reply to an OPEN or CLOSE due to a soft timeout
> (which
> is an important use case for us).
> 

In principle, RFC5661 says that the client is supposed to process all
stateid operations in the order dictated by the seqid. The one problem
with that mandate is that open-by-filename doesn't allow us to know
whether or not a seqid bump is outstanding. This is why we have the 5
second timeout wait in nfs_set_open_stateid_locked().

We could do something similar to that for CLOSE by setting the seqid to
0, and then ensuring we process the CLOSE in the order the server ends
up processing it. Unfortunately, that would require us to replay any
OPEN calls that got shut down because we used seqid 0 (it would also
not work for NFSv4.0)... So perhaps the best solution would be to
instead allow CLOSE to wait for outstanding OPENs to complete, just
like we do in nfs_set_open_stateid_locked()? Thoughts?

-- 
Trond Myklebust
Linux NFS client maintainer, Hammerspace
trond.myklebust@xxxxxxxxxxxxxxx