Re: Another OPEN / OPEN_DOWNGRADE race

Trond Myklebust <trondmy@xxxxxxxxxxxxxxx> · Fri, 22 Feb 2019 13:31:41 +0000

Hi Neil,

On Fri, 2019-02-22 at 16:02 +1100, NeilBrown wrote:
> On Fri, Feb 22 2019, Trond Myklebust wrote:
> 
> > Hi Neil
> > 
> > On Fri, 2019-02-22 at 11:58 +1100, NeilBrown wrote:
> > > Hi,
> > >  I have a report of an NFSv4.1 state management problem in our
> > > 4.4
> > >  kernel which appears to be caused by a race between OPEN and
> > >  OPEN_DOWNGRADE, and I don't think the race is fixed in mainline.
> > > 
> > >  The test program creates multiple threads which open a file
> > > O_RDWR
> > > and
> > >  write to it, then opens for O_RDONLY and reads from it.
> > >  A network trace which shows the problem suggests that at a point
> > > in
> > >  time where there are some O_RDONLY opens and one O_RDWR open, a
> > > new
> > >  O_RDWR open is requested just as the O_RDWR open is being
> > > closed.
> > > 
> > >  The close of the O_RDWR clears state->n_rdwr early so
> > > can_open_cached()
> > >  fails for the O_RDWR open, and it needs to go to the server.
> > >  The open for O_RDWR doesn't increment n_rdwr until after the
> > > open
> > >  succeeds, so nfs4_close_prepare sees
> > >     n_rdwr == 0
> > >     n_rdonly > 0
> > >     NFS_O_RDWR_STATE and NFS_O_RDONLY_STATE set
> > >  which causes it to choose an OPEN_DOWNGRADE.
> > > 
> > >  What we see is a OPEN/share-all and an OPEN_DOWNGRADE/share-read
> > >  request are sent one after the other without waiting for a
> > > reply.
> > >  The OPEN is processed first, then the OPEN_DOWNGRADE, resulting
> > > in a
> > >  state that only allows reads.  Then a WRITE is attempted which
> > > fails.
> > >  This enters a infinite loop with 2 steps:
> > >   - a WRITE gets NFS4ERR_OPENMODE
> > >   - a TEST_STATEID succeeds
> > > 
> > >  Once an OPEN/share-all request has been sent, it isn't really
> > > correct
> > >  to send an OPEN_DOWNGRADE/share-read request.  However the fact
> > > that
> > >  the OPEN has been sent isn't visible to nfs4_close_prepare().
> > > 
> > >  There is an asymmetry between open and close w.r.t. updating the
> > >  n_[mode] counter and setting the NFS_O_mode_STATE bits.
> > > 
> > >  For close, the counter is decremented, then the server is told,
> > > then
> > >  the state bits are cleared.
> > >  For open, the counter and state bits are both cleared after the
> > > server
> > >  is asked.
> > > 
> > >  I understand that this is probably because the OPEN could fail,
> > > and
> > >  incrementing a counter before we are sure of success seems
> > > unwise.  But
> > >  doing so would allow us to avoid the incorrect OPEN_DOWNGRADE.
> > > 
> > >  Any suggestions on what a good solution would look like?  Does
> > > it
> > > ever
> > >  make sense for an OPEN request to be concurrent with a CLOSE or
> > >  OPEN_DOWNGRADE ??  Maybe they should be serialized with each
> > > other
> > >  (maybe not as fully serialized as NFSv4.0, but more than they
> > > currently
> > >  are in NFSv4.1)
> 
> Hi Trond,
>  thanks for the reply.
> 
> > If the CLOSE/OPEN_DOWNGRADE is processed by the server _after_ the
> > OPEN, then the stateid presented by the client will have an
> > incorrect
> > seqid, and so the server should reject the operation with a
> > NFS4ERR_OLD_STATEID.
> 
> According to RFC5661, section  18.18.3, the DESCRIPTION of
> OPEN_DOWNGRADE,
> 
>     The seqid argument is not used in NFSv4.1, MAY be any value, and
>     MUST be ignored by the server.
> 
> So the fact that the stateid passed (3) is less than the stateids
> already returned for some OPEN operations (4 and 5), the server MUST
> still process the OPEN_DOWNGRADE, not give an error.

That refers to the explicit "seqid" parameter in struct
OPEN_DOWNGRADE4args. I'm referring to the seqid that constitutes the
first 4 bytes of the stateid.

In other words, I'm referring to the mechanism described here: 
https://tools.ietf.org/html/rfc5661#section-8.2.2

Note that the Linux client doesn't ever set the stateid seqid to zero
on CLOSE or OPEN_DOWNGRADE precisely because we rely on the
NFS4ERR_OLD_STATEID mechanism to detect races with OPEN.

> 
> > If, on the other hand, the CLOSE/OPEN_DOWNGRADE is processed before
> > the
> > OPEN, then the client should either see a new stateid with a new
> > 'other' field (CLOSE) or it will see the same stateid but with a
> > seqid
> > that does not match what it expects (OPEN_DOWNGRADE). In the
> > current
> > mainline code, the client will then attempt to wait for any
> > "missing"
> > seqids to get processed (see nfs_set_open_stateid_locked()).
> 
> Hmm.. I don't have the waiting in my 4.4 code - I should probably
> make
> sure I understand that.  However the tcpdump trace shows that this
> scenario isn't what is happening, so it isn't immediately relevant.
> 

True. The waiting mechanism was introduced by commit c9399f21c215
("NFSv4: Fix OPEN / CLOSE race") in v4.15. We should probably have
kicked that down the stable patch mechanism.
Cheers
  Trond
-- 
Trond Myklebust
Linux NFS client maintainer, Hammerspace
trond.myklebust@xxxxxxxxxxxxxxx