Re: [PATCH 3/3] NFSv4.1: Detect and retry after OPEN and CLOSE/DOWNGRADE race

Trond Myklebust <trondmy@xxxxxxxxxxxxxxx> · Tue, 17 Oct 2017 17:42:24 +0000

On Tue, 2017-10-17 at 13:33 -0400, Benjamin Coddington wrote:
> On 17 Oct 2017, at 11:49, Trond Myklebust wrote:
> 
> > On Tue, 2017-10-17 at 10:46 -0400, Benjamin Coddington wrote:
> > > If the client issues two simultaneous OPEN calls, and the
> > > response to
> > > the
> > > first OPEN call (sequence id 1) is delayed before updating the
> > > nfs4_state,
> > > then the client's nfs4_state can transition through a complete
> > > lifecycle of
> > > OPEN (state sequence id 2), and CLOSE (state sequence id
> > > 3).  When
> > > the
> > > first OPEN is finally processed, the nfs4_state is incorrectly
> > > transitioned
> > > back to NFS_OPEN_STATE for the first OPEN (sequence id
> > > 1).  Subsequent calls
> > > to LOCK or CLOSE will receive NFS4ERR_BAD_STATEID, and trigger
> > > state
> > > recovery.
> > > 
> > > Fix this by passing back the result of need_update_open_stateid()
> > > to
> > > the
> > > open() path, with the result to try again if the OPEN's stateid
> > > should not
> > > be updated.
> > > 
> > 
> > Why are we worried about the special case where the client actually
> > finds the closed stateid in its cache?
> 
> Because I am hitting that case very frequently in generic/089, and I
> hate
> how unnecessary state recovery slows everything down.  I'm also 

Why is it being hit. Is the client processing stuff out of order?

> seeing a
> problem where generic/089 never completes, and I was trying to get
> this
> behavior out of the way.
> 
> > In the more general case of your race, the stateid might not be
> > found
> > at all because the CLOSE completes and is processed on the client
> > before it can process the reply from the delayed OPEN. If so, we
> > really
> > have no way to detect that the file has actually been closed by the
> > server until we see the NFS4ERR_BAD_STATEID.
> 
> I mentioned this case in the cover letter.  It's possible that the
> client
> could retain a record of a closed stateid in order to retry an OPEN
> in that

That would require us to retain all stateids until there are no more
pending OPEN calls.

> case.  Another approach may be to detect 'holes' in the state id
> sequence
> and not call CLOSE until each id is processed.  I think there's an
> existing

We can't know we have a hole until we know the starting value of the
seqid, which is undefined according to RFC 5661 section 3.3.12.

> problem now where this race (without the CLOSE) can incorrectly
> update the
> state flags with the result of the delayed OPEN's mode.
> 
> > Note also that section 18.2.4 says that "the server SHOULD return
> > the
> > invalid special stateid (the "other" value is zero and the "seqid"
> > field is NFS4_UINT32_MAX, see Section 8.2.3)" which further ensures
> > that we should not be able to match the OPEN stateid once the CLOSE
> > is
> > processed on the client.
> 
> Ah, right, so need_update_open_stateid() should handle that
> case.  But that
> doesn't mean there's no way to match the OPEN statid after a
> CLOSE.  We
> could still keep a record of the closed state with a flag instead of
> an
> incremented sequence id.
> 
> So perhaps keeping track of holes in the sequence would be a better
> approach, but then the OPEN response handling needs to call CLOSE in
> case
> the OPEN fails.. that seems more complicated.
> 
> Ben
> 
-- 
Trond Myklebust
Linux NFS client maintainer, PrimaryData
trond.myklebust@xxxxxxxxxxxxxxx
��.n��������+%������w��{.n�����{��w���jg��������ݢj����G�������j:+v���w�m������w�������h�����٥