Re: [PATCH 0/3] Handling NFSv3 I/O errors in knfsd

Trond Myklebust <trondmy@xxxxxxxxxxxxxxx> · Tue, 27 Aug 2019 14:53:01 +0000

On Tue, 2019-08-27 at 09:59 -0400, Chuck Lever wrote:
> > On Aug 26, 2019, at 5:02 PM, Trond Myklebust <
> > trondmy@xxxxxxxxxxxxxxx> wrote:
> > 
> > On Mon, 2019-08-26 at 16:51 -0400, J. Bruce Fields wrote:
> > > On Mon, Aug 26, 2019 at 12:50:18PM -0400, Trond Myklebust wrote:
> > > > Recently, a number of changes went into the kernel to try to
> > > > ensure
> > > > that I/O errors (specifically write errors) are reported to the
> > > > application once and only once. The vehicle for ensuring the
> > > > errors
> > > > are reported is the struct file, which uses the 'f_wb_err'
> > > > field to
> > > > track which errors have been reported.
> > > > 
> > > > The problem is that errors are mainly intended to be reported
> > > > through
> > > > fsync(). If the client is doing synchronous writes, then all is
> > > > well,
> > > > but if it is doing unstable writes, then the errors may not be
> > > > reported until the client calls COMMIT. If the file cache has
> > > > thrown out the struct file, due to memory pressure, or just
> > > > because
> > > > the client took a long while between the last WRITE and the
> > > > COMMIT,
> > > > then the error report may be lost, and the client may just
> > > > think
> > > > its data is safely stored.
> > > 
> > > These were lost before the file caching patches as well,
> > > right?  Or
> > > is
> > > there some regression? 
> > 
> > Correct. This is not a regression, but an attempt to fix a problem
> > that
> > has existed for some time now.
> > 
> > > > Note that the problem is compounded by the fact that NFSv3 is
> > > > stateless,
> > > > so the server never knows that the client may have rebooted, so
> > > > there
> > > > can be no guarantee that a COMMIT will ever be sent.
> > > > 
> > > > The following patch set attempts to remedy the situation using
> > > > 2
> > > > strategies:
> > > > 
> > > > 1) If the inode is dirty, then avoid garbage collecting the
> > > > file
> > > >   from the file cache.
> > > > 2) If the file is closed, and we see that it would have
> > > > reported
> > > >   an error to COMMIT, then we bump the boot verifier in order
> > > > to
> > > >   ensure the client retransmits all its writes.
> > > 
> > > Sounds sensible to me.
> > > 
> > > > Note that if multiple clients were writing to the same file,
> > > > then
> > > > we probably want to bump the boot verifier anyway, since only
> > > > one
> > > > COMMIT will see the error report (because the cached file is
> > > > also
> > > > shared).
> > > 
> > > I'm confused by the "probably should".  So that's future work?  I
> > > guess
> > > it'd mean some additional work to identify that case.  You can't
> > > really
> > > even distinguish clients in the NFSv3 case, but I suppose you
> > > could
> > > use
> > > IP address or TCP connection as an approximation.
> > 
> > I'm suggesting we should do this too, but I haven't done so yet in
> > these patches. I'd like to hear other opinions (particularly from
> > you,
> > Chuck and Jeff).
> 
> The strategy of handling these errors more carefully seems good.
> Bumping the write/commit verifier so the client writes again to
> retrieve the latent error is clever!
> 
> It's not clear to me though that the NFSv3 protocol can deal with
> the multi-client write scenario, since it is stateless. We are now
> making it stateful in some sense by preserving error state on the
> server across NFS requests, without having any sense of an open
> file in the protocol itself.
> 
> Would an "approximation" without open state be good enough? I
> assume you are doing this to more fully support the FlexFiles
> layout type. Do you have any analysis or thought about this next
> step?

No, see above. This has nothing to do with the flex files layout type.
It is about ensuring the integrity of _all_ unstable writes. Any client
can hit writeback errors that only manifest themselves when the data is
flushed to disk.

> I also echo Bruce's concern about whether the client implementations
> are up to snuff. There could be long-standing bugs or their protocol
> implementation could be missing parts. This is more curiosity than
> an objection, but maybe noting which client implementations you've
> tested with would be good.

So what exactly is the concern? How do you and Bruce expect this might
make things worse?

As far as I can see, if a client doesn't handle errors in COMMIT, then
it is screwed either way, since we can already end up reporting such
errors today through the fsync() mechanism. If the client ignores the
error, it ends up silently corrupting the file. If it retries the
writes as unstable, then it loops until the error is cleared. The
current patchset does not affect either case for this single client:
the declaration of a reboot just ensures that silent corruption does
not occur, and if the client handles the error correctly once it
catches it, then great, otherwise no extra harm occurs.

If we want to fix the case of multiple writers (i.e. the case where
someone is using NLM locking to mediate between writers, or they are
using O_DIRECT and some external synchronisation mechanism) then we
need to do more (hence the proposal that we extend the current patches
to also declare a reboot on any failure of COMMIT).
In that case, the broken client, that loops forever, is going to cause
contention with all the other clients anyway, since it will never
complete its I/O (or it will silently corrupt the file if it doesn't
see the error). By declaring a reboot we don't make this case worse as
far as I can see: we still end up looping forever, but the file doesn't
get silently corrupted.
The one problem is that the looping forever client can cause other
clients to loop forever on their otherwise successful writes on other
files. That's bad, but again, that's due to client behaviour that is
toxic even today.

-- 
Trond Myklebust
Linux NFS client maintainer, Hammerspace
trond.myklebust@xxxxxxxxxxxxxxx