On 14 Feb 2023, at 15:30, Trond Myklebust wrote: > On Tue, 2023-02-14 at 14:39 -0500, Benjamin Coddington wrote: >> We've observed that if the NFS client experiences a network partition >> and >> the server revokes the client's state, the client may not revalidate >> cached >> data for an open file during recovery. If the file is extended by a >> second >> client during this network partition, the first client will correctly >> update the file's size and attributes during recovery, but another >> extending write will discard the second client's data. > > I'm having trouble fully understanding your problem description. Is the > issue that both clients are opening the same file with something like > O_WRONLY|O_DSYNC|O_APPEND? Yes. > If so, what if the network partition happens during the write() system > call of client 1, so that the page cache is updated but the flush of > the write data ends up being delayed by the partition? > In that case, client 2 doesn't know that client 1 has writes > outstanding so it may write its data to where the server thinks the eof > offset is. However once client 1 is able to recover its open state, it > will still have dirty page cache data that is going to overwrite that > same offset. Ah, yes. :( In this case we might be safe saying that close-to-open consistency is preserved, though, if the second client has closed the file. At least client 1 will have the data laid out in the file the way it expected. >> In the case where another client opened the file during the network >> partition and the server revoked the first client's state, the >> recovery can >> forego optimizations and instead attempt to avoid corruption. >> >> It's a little tricky to solve this in a per-file way during recovery >> without plumbing arguments or introducing new flags. This patch >> side-steps >> the per-file complexity by simply checking if the client is within a >> NOGRACE recovery window, and if so, invalidates data during the open >> recovery. >> > > I don't see how this scenario can ever be made fully safe. If people > care, then we should probably have the open recovery of client 1 fail > altogether in this case (subject to some switch similar to the existing > 'recover_lost_locks' kernel option). Its quite a corner case. I actually don't have a broken setup yet. We noticed this behavior appear in a regression test now that knfsd will both: 1) give a read delegation for RW open and 2) forget client state if a delegation callback times out. Otherwise, without the delegation, the server doesn't send the client back through recovery and the appending open/write happens at the correct eof. Maybe that server behavior will trigger some real problem reports, and I should look into making the server a little nicer to the client for a short partition. Ben