Hi Neil, On Fri, 2019-02-22 at 16:02 +1100, NeilBrown wrote: > On Fri, Feb 22 2019, Trond Myklebust wrote: > > > Hi Neil > > > > On Fri, 2019-02-22 at 11:58 +1100, NeilBrown wrote: > > > Hi, > > > I have a report of an NFSv4.1 state management problem in our > > > 4.4 > > > kernel which appears to be caused by a race between OPEN and > > > OPEN_DOWNGRADE, and I don't think the race is fixed in mainline. > > > > > > The test program creates multiple threads which open a file > > > O_RDWR > > > and > > > write to it, then opens for O_RDONLY and reads from it. > > > A network trace which shows the problem suggests that at a point > > > in > > > time where there are some O_RDONLY opens and one O_RDWR open, a > > > new > > > O_RDWR open is requested just as the O_RDWR open is being > > > closed. > > > > > > The close of the O_RDWR clears state->n_rdwr early so > > > can_open_cached() > > > fails for the O_RDWR open, and it needs to go to the server. > > > The open for O_RDWR doesn't increment n_rdwr until after the > > > open > > > succeeds, so nfs4_close_prepare sees > > > n_rdwr == 0 > > > n_rdonly > 0 > > > NFS_O_RDWR_STATE and NFS_O_RDONLY_STATE set > > > which causes it to choose an OPEN_DOWNGRADE. > > > > > > What we see is a OPEN/share-all and an OPEN_DOWNGRADE/share-read > > > request are sent one after the other without waiting for a > > > reply. > > > The OPEN is processed first, then the OPEN_DOWNGRADE, resulting > > > in a > > > state that only allows reads. Then a WRITE is attempted which > > > fails. > > > This enters a infinite loop with 2 steps: > > > - a WRITE gets NFS4ERR_OPENMODE > > > - a TEST_STATEID succeeds > > > > > > Once an OPEN/share-all request has been sent, it isn't really > > > correct > > > to send an OPEN_DOWNGRADE/share-read request. However the fact > > > that > > > the OPEN has been sent isn't visible to nfs4_close_prepare(). > > > > > > There is an asymmetry between open and close w.r.t. updating the > > > n_[mode] counter and setting the NFS_O_mode_STATE bits. > > > > > > For close, the counter is decremented, then the server is told, > > > then > > > the state bits are cleared. > > > For open, the counter and state bits are both cleared after the > > > server > > > is asked. > > > > > > I understand that this is probably because the OPEN could fail, > > > and > > > incrementing a counter before we are sure of success seems > > > unwise. But > > > doing so would allow us to avoid the incorrect OPEN_DOWNGRADE. > > > > > > Any suggestions on what a good solution would look like? Does > > > it > > > ever > > > make sense for an OPEN request to be concurrent with a CLOSE or > > > OPEN_DOWNGRADE ?? Maybe they should be serialized with each > > > other > > > (maybe not as fully serialized as NFSv4.0, but more than they > > > currently > > > are in NFSv4.1) > > Hi Trond, > thanks for the reply. > > > If the CLOSE/OPEN_DOWNGRADE is processed by the server _after_ the > > OPEN, then the stateid presented by the client will have an > > incorrect > > seqid, and so the server should reject the operation with a > > NFS4ERR_OLD_STATEID. > > According to RFC5661, section 18.18.3, the DESCRIPTION of > OPEN_DOWNGRADE, > > The seqid argument is not used in NFSv4.1, MAY be any value, and > MUST be ignored by the server. > > So the fact that the stateid passed (3) is less than the stateids > already returned for some OPEN operations (4 and 5), the server MUST > still process the OPEN_DOWNGRADE, not give an error. That refers to the explicit "seqid" parameter in struct OPEN_DOWNGRADE4args. I'm referring to the seqid that constitutes the first 4 bytes of the stateid. In other words, I'm referring to the mechanism described here: https://tools.ietf.org/html/rfc5661#section-8.2.2 Note that the Linux client doesn't ever set the stateid seqid to zero on CLOSE or OPEN_DOWNGRADE precisely because we rely on the NFS4ERR_OLD_STATEID mechanism to detect races with OPEN. > > > If, on the other hand, the CLOSE/OPEN_DOWNGRADE is processed before > > the > > OPEN, then the client should either see a new stateid with a new > > 'other' field (CLOSE) or it will see the same stateid but with a > > seqid > > that does not match what it expects (OPEN_DOWNGRADE). In the > > current > > mainline code, the client will then attempt to wait for any > > "missing" > > seqids to get processed (see nfs_set_open_stateid_locked()). > > Hmm.. I don't have the waiting in my 4.4 code - I should probably > make > sure I understand that. However the tcpdump trace shows that this > scenario isn't what is happening, so it isn't immediately relevant. > True. The waiting mechanism was introduced by commit c9399f21c215 ("NFSv4: Fix OPEN / CLOSE race") in v4.15. We should probably have kicked that down the stable patch mechanism. Cheers Trond -- Trond Myklebust Linux NFS client maintainer, Hammerspace trond.myklebust@xxxxxxxxxxxxxxx