Re: handling ERR_SERVERFAULT on RESTOREFH

Trond Myklebust <trondmy@xxxxxxxxxxxxxxx> · Tue, 28 Apr 2020 23:56:16 +0000

On Tue, 2020-04-28 at 19:41 -0400, Trond Myklebust wrote:
> On Tue, 2020-04-28 at 19:02 -0400, Olga Kornievskaia wrote:
> > On Tue, Apr 28, 2020 at 5:32 PM Trond Myklebust <
> > trondmy@xxxxxxxxxxxxxxx> wrote:
> > > On Tue, 2020-04-28 at 16:40 -0400, Olga Kornievskaia wrote:
> > > > On Tue, Apr 28, 2020 at 2:47 PM Trond Myklebust <
> > > > trondmy@xxxxxxxxxxxxxxx> wrote:
> > > > > Hi Olga,
> > > > > 
> > > > > On Tue, 2020-04-28 at 14:14 -0400, Olga Kornievskaia wrote:
> > > > > > Hi folk,
> > > > > > 
> > > > > > Looking for guidance on what folks think. A client is
> > > > > > sending
> > > > > > a
> > > > > > LINK
> > > > > > operation to the server. This compound after the LINK has
> > > > > > RESTOREFH
> > > > > > and GETATTR. Server returns SERVER_FAULT to on RESTOREFH.
> > > > > > But
> > > > > > LINK is
> > > > > > done successfully. Client still fails the system call with
> > > > > > EIO.
> > > > > > We
> > > > > > have a hardline and "ln" saying hardlink failed.
> > > > > > 
> > > > > > Should the client not fail the system call in this case?
> > > > > > The
> > > > > > fact
> > > > > > that
> > > > > > we couldn't get up-to-date attributes don't seem like the
> > > > > > reason
> > > > > > to
> > > > > > fail the system call?
> > > > > > 
> > > > > > Thank you.
> > > > > 
> > > > > I don't really see this as worth fixing on the client. It is
> > > > > very
> > > > > clearly a server bug.
> > > > 
> > > > Why is that a server bug? A server can legitimately have an
> > > > issue
> > > > trying to execute an operation (RESTOREFH) and legitimately
> > > > returning
> > > > an error.
> > > 
> > > If it is happening consistently on the server, then it is a bug,
> > > and it
> > > gets reported by the client in the same way we always report
> > > NFS4ERR_SERVERFAULT, by converting to an EREMOTEIO.
> > 
> > Yes but the client doesn't retry so it can't assess if it's
> > consistently happening or not. It can be a transient error (or
> > ENOMEM)
> > that's later resolved.
> 
> If the server wants to signal a transient error, it should send
> NFS4ERR_DELAY.
> 
> > > > NFS client also ignores errors of the returning GETATTR after
> > > > the
> > > > RESTOREFH. So I'm not sure why we are then not ignoring errors
> > > > (or
> > > > some errors) of the RESTOREFH.
> > > 
> > > We do need to check the value of RESTOREFH in order to figure out
> > > if we
> > > can continue reading the XDR buffer to decode the file
> > > attributes.
> > > We
> > > want to read those file attributes because we do expect the
> > > change
> > > attribute, the ctime and the nlinks values to all change as a
> > > result of
> > > the operation.
> > 
> > I have nothing against decoding the error and using it in a
> > decision
> > to keep decoding. But the client doesn't have to propagate the
> > RESTOREFH error to the application?
> > 
> > In all other non-idempotent operations that have other operations
> > (ie
> > GETATTR) following them, the client ignores the errors. Btw I just
> > noticed that on OPEN compound, since we ignore decode error from
> > the
> > GETATTR, it would continue decoding LAYOUTGET...
> > 
> > CREATE has problem if the following GETFH will return EDELAY.
> > Client
> > doesn't deal with retrying a part of the compound. It retries the
> > whole compound. It leads to an error (since non-idempotent
> > operation
> > is retried). But I guess that's a 2nd issue (or a 3rd if we could
> > the
> > decoding layoutget)....
> > 
> > All this is under the umbrella of how to handle errors on
> > non-idempotent operations in a compound....
> 
> There is no point in trying to handle errors that make no sense. If
> the
> server has a bug, then let's expose it instead of trying to hide it
> in
> the sofa cushions.

It basically boils down to this:
 * I do not want to have to maintain a list of server bugs in the
   client.
 * If I make a client change, I don't want to have to consider whether
   or not it causes a regression against some server that only 10
   people are still using, and that never got a fix for a bug because
   "the clients handle it".

We should fix client bugs when it is clear that they are client bugs.
We should push back hard on server bugs because the client is already a
complex enough beast.

-- 
Trond Myklebust
Linux NFS client maintainer, Hammerspace
trond.myklebust@xxxxxxxxxxxxxxx