On Tue, Aug 27, 2019 at 12:56:07AM +0000, Trond Myklebust wrote: > On Mon, 2019-08-26 at 20:48 -0400, bfields@xxxxxxxxxxxx wrote: > > On Mon, Aug 26, 2019 at 09:02:31PM +0000, Trond Myklebust wrote: > > > On Mon, 2019-08-26 at 16:51 -0400, J. Bruce Fields wrote: > > > > On Mon, Aug 26, 2019 at 12:50:18PM -0400, Trond Myklebust wrote: > > > > > Note that if multiple clients were writing to the same file, > > > > > then we probably want to bump the boot verifier anyway, since > > > > > only one COMMIT will see the error report (because the cached > > > > > file is also shared). > > > > > > > > I'm confused by the "probably should". So that's future work? > > > > I guess it'd mean some additional work to identify that case. > > > > You can't really even distinguish clients in the NFSv3 case, but > > > > I suppose you could use IP address or TCP connection as an > > > > approximation. > > > > > > I'm suggesting we should do this too, but I haven't done so yet in > > > these patches. I'd like to hear other opinions (particularly from > > > you, Chuck and Jeff). > > > > Does this process actually converge, or do we end up with all the > > clients retrying the writes and, again, only one of them getting the > > error? > > The client that gets the error should stop retrying if the error is > fatal. Have clients historically been good about that? I just wonder whether it's a concern that boot-verifier-bumping could magnify the impact of clients that are overly persistent about retrying IO errors. > > I wonder what the typical errors are, anyway. > > I would expect ENOSPC, and EIO to be the most common. The former if > delayed allocation and/or snapshots result in writes failing after > writing to the page cache. The latter if we hit a disk outage or other > such problem. Makes sense. --b.