Re: [PATCH 0/3] Handling NFSv3 I/O errors in knfsd

Trond Myklebust <trondmy@xxxxxxxxxxxxxxx> · Tue, 27 Aug 2019 01:28:15 +0000

On Mon, 2019-08-26 at 21:13 -0400, bfields@xxxxxxxxxxxx wrote:
> On Tue, Aug 27, 2019 at 12:56:07AM +0000, Trond Myklebust wrote:
> > On Mon, 2019-08-26 at 20:48 -0400, bfields@xxxxxxxxxxxx wrote:
> > > On Mon, Aug 26, 2019 at 09:02:31PM +0000, Trond Myklebust wrote:
> > > > On Mon, 2019-08-26 at 16:51 -0400, J. Bruce Fields wrote:
> > > > > On Mon, Aug 26, 2019 at 12:50:18PM -0400, Trond Myklebust
> > > > > wrote:
> > > > > > Note that if multiple clients were writing to the same
> > > > > > file,
> > > > > > then we probably want to bump the boot verifier anyway,
> > > > > > since
> > > > > > only one COMMIT will see the error report (because the
> > > > > > cached
> > > > > > file is also shared).
> > > > > 
> > > > > I'm confused by the "probably should".  So that's future
> > > > > work?
> > > > > I guess it'd mean some additional work to identify that case.
> > > > > You can't really even distinguish clients in the NFSv3 case,
> > > > > but
> > > > > I suppose you could use IP address or TCP connection as an
> > > > > approximation.
> > > > 
> > > > I'm suggesting we should do this too, but I haven't done so yet
> > > > in
> > > > these patches. I'd like to hear other opinions (particularly
> > > > from
> > > > you, Chuck and Jeff).
> > > 
> > > Does this process actually converge, or do we end up with all the
> > > clients retrying the writes and, again, only one of them getting
> > > the
> > > error?
> > 
> > The client that gets the error should stop retrying if the error is
> > fatal.
> 
> Have clients historically been good about that?  I just wonder
> whether
> it's a concern that boot-verifier-bumping could magnify the impact of
> clients that are overly persistent about retrying IO errors.
> 

Clients have always been required to handle I/O errors, yes, and this
isn't just a Linux server thing. All other servers that support
unstable writes impose the same requirement on the client to check the
return value of COMMIT and to handle any errors.

> > > I wonder what the typical errors are, anyway.
> > 
> > I would expect ENOSPC, and EIO to be the most common. The former if
> > delayed allocation and/or snapshots result in writes failing after
> > writing to the page cache. The latter if we hit a disk outage or
> > other
> > such problem.
> 
> Makes sense.
> 
> --b.
-- 
Trond Myklebust
Linux NFS client maintainer, Hammerspace
trond.myklebust@xxxxxxxxxxxxxxx