On Mon, 2019-08-26 at 21:13 -0400, bfields@xxxxxxxxxxxx wrote: > On Tue, Aug 27, 2019 at 12:56:07AM +0000, Trond Myklebust wrote: > > On Mon, 2019-08-26 at 20:48 -0400, bfields@xxxxxxxxxxxx wrote: > > > On Mon, Aug 26, 2019 at 09:02:31PM +0000, Trond Myklebust wrote: > > > > On Mon, 2019-08-26 at 16:51 -0400, J. Bruce Fields wrote: > > > > > On Mon, Aug 26, 2019 at 12:50:18PM -0400, Trond Myklebust > > > > > wrote: > > > > > > Note that if multiple clients were writing to the same > > > > > > file, > > > > > > then we probably want to bump the boot verifier anyway, > > > > > > since > > > > > > only one COMMIT will see the error report (because the > > > > > > cached > > > > > > file is also shared). > > > > > > > > > > I'm confused by the "probably should". So that's future > > > > > work? > > > > > I guess it'd mean some additional work to identify that case. > > > > > You can't really even distinguish clients in the NFSv3 case, > > > > > but > > > > > I suppose you could use IP address or TCP connection as an > > > > > approximation. > > > > > > > > I'm suggesting we should do this too, but I haven't done so yet > > > > in > > > > these patches. I'd like to hear other opinions (particularly > > > > from > > > > you, Chuck and Jeff). > > > > > > Does this process actually converge, or do we end up with all the > > > clients retrying the writes and, again, only one of them getting > > > the > > > error? > > > > The client that gets the error should stop retrying if the error is > > fatal. > > Have clients historically been good about that? I just wonder > whether > it's a concern that boot-verifier-bumping could magnify the impact of > clients that are overly persistent about retrying IO errors. > Clients have always been required to handle I/O errors, yes, and this isn't just a Linux server thing. All other servers that support unstable writes impose the same requirement on the client to check the return value of COMMIT and to handle any errors. > > > I wonder what the typical errors are, anyway. > > > > I would expect ENOSPC, and EIO to be the most common. The former if > > delayed allocation and/or snapshots result in writes failing after > > writing to the page cache. The latter if we hit a disk outage or > > other > > such problem. > > Makes sense. > > --b. -- Trond Myklebust Linux NFS client maintainer, Hammerspace trond.myklebust@xxxxxxxxxxxxxxx