On Fri, Feb 22 2019, Trond Myklebust wrote: > Hi Neil, > > On Fri, 2019-02-22 at 16:02 +1100, NeilBrown wrote: >> On Fri, Feb 22 2019, Trond Myklebust wrote: >> >> > Hi Neil >> > >> > On Fri, 2019-02-22 at 11:58 +1100, NeilBrown wrote: >> > > Hi, >> > > I have a report of an NFSv4.1 state management problem in our >> > > 4.4 >> > > kernel which appears to be caused by a race between OPEN and >> > > OPEN_DOWNGRADE, and I don't think the race is fixed in mainline. >> > > >> > > The test program creates multiple threads which open a file >> > > O_RDWR >> > > and >> > > write to it, then opens for O_RDONLY and reads from it. >> > > A network trace which shows the problem suggests that at a point >> > > in >> > > time where there are some O_RDONLY opens and one O_RDWR open, a >> > > new >> > > O_RDWR open is requested just as the O_RDWR open is being >> > > closed. >> > > >> > > The close of the O_RDWR clears state->n_rdwr early so >> > > can_open_cached() >> > > fails for the O_RDWR open, and it needs to go to the server. >> > > The open for O_RDWR doesn't increment n_rdwr until after the >> > > open >> > > succeeds, so nfs4_close_prepare sees >> > > n_rdwr == 0 >> > > n_rdonly > 0 >> > > NFS_O_RDWR_STATE and NFS_O_RDONLY_STATE set >> > > which causes it to choose an OPEN_DOWNGRADE. >> > > >> > > What we see is a OPEN/share-all and an OPEN_DOWNGRADE/share-read >> > > request are sent one after the other without waiting for a >> > > reply. >> > > The OPEN is processed first, then the OPEN_DOWNGRADE, resulting >> > > in a >> > > state that only allows reads. Then a WRITE is attempted which >> > > fails. >> > > This enters a infinite loop with 2 steps: >> > > - a WRITE gets NFS4ERR_OPENMODE >> > > - a TEST_STATEID succeeds >> > > >> > > Once an OPEN/share-all request has been sent, it isn't really >> > > correct >> > > to send an OPEN_DOWNGRADE/share-read request. However the fact >> > > that >> > > the OPEN has been sent isn't visible to nfs4_close_prepare(). >> > > >> > > There is an asymmetry between open and close w.r.t. updating the >> > > n_[mode] counter and setting the NFS_O_mode_STATE bits. >> > > >> > > For close, the counter is decremented, then the server is told, >> > > then >> > > the state bits are cleared. >> > > For open, the counter and state bits are both cleared after the >> > > server >> > > is asked. >> > > >> > > I understand that this is probably because the OPEN could fail, >> > > and >> > > incrementing a counter before we are sure of success seems >> > > unwise. But >> > > doing so would allow us to avoid the incorrect OPEN_DOWNGRADE. >> > > >> > > Any suggestions on what a good solution would look like? Does >> > > it >> > > ever >> > > make sense for an OPEN request to be concurrent with a CLOSE or >> > > OPEN_DOWNGRADE ?? Maybe they should be serialized with each >> > > other >> > > (maybe not as fully serialized as NFSv4.0, but more than they >> > > currently >> > > are in NFSv4.1) >> >> Hi Trond, >> thanks for the reply. >> >> > If the CLOSE/OPEN_DOWNGRADE is processed by the server _after_ the >> > OPEN, then the stateid presented by the client will have an >> > incorrect >> > seqid, and so the server should reject the operation with a >> > NFS4ERR_OLD_STATEID. >> >> According to RFC5661, section 18.18.3, the DESCRIPTION of >> OPEN_DOWNGRADE, >> >> The seqid argument is not used in NFSv4.1, MAY be any value, and >> MUST be ignored by the server. >> >> So the fact that the stateid passed (3) is less than the stateids >> already returned for some OPEN operations (4 and 5), the server MUST >> still process the OPEN_DOWNGRADE, not give an error. > > That refers to the explicit "seqid" parameter in struct > OPEN_DOWNGRADE4args. I'm referring to the seqid that constitutes the > first 4 bytes of the stateid. > > In other words, I'm referring to the mechanism described here: > https://tools.ietf.org/html/rfc5661#section-8.2.2 > > Note that the Linux client doesn't ever set the stateid seqid to zero > on CLOSE or OPEN_DOWNGRADE precisely because we rely on the > NFS4ERR_OLD_STATEID mechanism to detect races with OPEN. Ahhh, I had missed that there were two seqids in OPEN_DOWNGRADE, on in the stateid, one not. Makes sense now. With that perspective, it is pretty clear that the server is misbehaving. The OPEN_DOWNGRADE presents a seqid of 3 and after returning two open requests with seqids of 4 and 5, the server processes that OPEN_DOWNGRADE without error and returns a seqid of 6. Thanks a lot, NeilBrown > >> >> > If, on the other hand, the CLOSE/OPEN_DOWNGRADE is processed before >> > the >> > OPEN, then the client should either see a new stateid with a new >> > 'other' field (CLOSE) or it will see the same stateid but with a >> > seqid >> > that does not match what it expects (OPEN_DOWNGRADE). In the >> > current >> > mainline code, the client will then attempt to wait for any >> > "missing" >> > seqids to get processed (see nfs_set_open_stateid_locked()). >> >> Hmm.. I don't have the waiting in my 4.4 code - I should probably >> make >> sure I understand that. However the tcpdump trace shows that this >> scenario isn't what is happening, so it isn't immediately relevant. >> > > True. The waiting mechanism was introduced by commit c9399f21c215 > ("NFSv4: Fix OPEN / CLOSE race") in v4.15. We should probably have > kicked that down the stable patch mechanism. > Cheers > Trond > -- > Trond Myklebust > Linux NFS client maintainer, Hammerspace > trond.myklebust@xxxxxxxxxxxxxxx
Attachment:
signature.asc
Description: PGP signature