Re: handling ERR_SERVERFAULT on RESTOREFH

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



On Tue, Apr 28, 2020 at 7:56 PM Trond Myklebust <trondmy@xxxxxxxxxxxxxxx> wrote:
>
> On Tue, 2020-04-28 at 19:41 -0400, Trond Myklebust wrote:
> > On Tue, 2020-04-28 at 19:02 -0400, Olga Kornievskaia wrote:
> > > On Tue, Apr 28, 2020 at 5:32 PM Trond Myklebust <
> > > trondmy@xxxxxxxxxxxxxxx> wrote:
> > > > On Tue, 2020-04-28 at 16:40 -0400, Olga Kornievskaia wrote:
> > > > > On Tue, Apr 28, 2020 at 2:47 PM Trond Myklebust <
> > > > > trondmy@xxxxxxxxxxxxxxx> wrote:
> > > > > > Hi Olga,
> > > > > >
> > > > > > On Tue, 2020-04-28 at 14:14 -0400, Olga Kornievskaia wrote:
> > > > > > > Hi folk,
> > > > > > >
> > > > > > > Looking for guidance on what folks think. A client is
> > > > > > > sending
> > > > > > > a
> > > > > > > LINK
> > > > > > > operation to the server. This compound after the LINK has
> > > > > > > RESTOREFH
> > > > > > > and GETATTR. Server returns SERVER_FAULT to on RESTOREFH.
> > > > > > > But
> > > > > > > LINK is
> > > > > > > done successfully. Client still fails the system call with
> > > > > > > EIO.
> > > > > > > We
> > > > > > > have a hardline and "ln" saying hardlink failed.
> > > > > > >
> > > > > > > Should the client not fail the system call in this case?
> > > > > > > The
> > > > > > > fact
> > > > > > > that
> > > > > > > we couldn't get up-to-date attributes don't seem like the
> > > > > > > reason
> > > > > > > to
> > > > > > > fail the system call?
> > > > > > >
> > > > > > > Thank you.
> > > > > >
> > > > > > I don't really see this as worth fixing on the client. It is
> > > > > > very
> > > > > > clearly a server bug.
> > > > >
> > > > > Why is that a server bug? A server can legitimately have an
> > > > > issue
> > > > > trying to execute an operation (RESTOREFH) and legitimately
> > > > > returning
> > > > > an error.
> > > >
> > > > If it is happening consistently on the server, then it is a bug,
> > > > and it
> > > > gets reported by the client in the same way we always report
> > > > NFS4ERR_SERVERFAULT, by converting to an EREMOTEIO.
> > >
> > > Yes but the client doesn't retry so it can't assess if it's
> > > consistently happening or not. It can be a transient error (or
> > > ENOMEM)
> > > that's later resolved.
> >
> > If the server wants to signal a transient error, it should send
> > NFS4ERR_DELAY.
> >
> > > > > NFS client also ignores errors of the returning GETATTR after
> > > > > the
> > > > > RESTOREFH. So I'm not sure why we are then not ignoring errors
> > > > > (or
> > > > > some errors) of the RESTOREFH.
> > > >
> > > > We do need to check the value of RESTOREFH in order to figure out
> > > > if we
> > > > can continue reading the XDR buffer to decode the file
> > > > attributes.
> > > > We
> > > > want to read those file attributes because we do expect the
> > > > change
> > > > attribute, the ctime and the nlinks values to all change as a
> > > > result of
> > > > the operation.
> > >
> > > I have nothing against decoding the error and using it in a
> > > decision
> > > to keep decoding. But the client doesn't have to propagate the
> > > RESTOREFH error to the application?
> > >
> > > In all other non-idempotent operations that have other operations
> > > (ie
> > > GETATTR) following them, the client ignores the errors. Btw I just
> > > noticed that on OPEN compound, since we ignore decode error from
> > > the
> > > GETATTR, it would continue decoding LAYOUTGET...
> > >
> > > CREATE has problem if the following GETFH will return EDELAY.
> > > Client
> > > doesn't deal with retrying a part of the compound. It retries the
> > > whole compound. It leads to an error (since non-idempotent
> > > operation
> > > is retried). But I guess that's a 2nd issue (or a 3rd if we could
> > > the
> > > decoding layoutget)....
> > >
> > > All this is under the umbrella of how to handle errors on
> > > non-idempotent operations in a compound....
> >
> > There is no point in trying to handle errors that make no sense. If
> > the
> > server has a bug, then let's expose it instead of trying to hide it
> > in
> > the sofa cushions.
>
> It basically boils down to this:
>  * I do not want to have to maintain a list of server bugs in the
>    client.

I also believe that client shouldn't be coded to a broken server. But
in some of those cases, the client is not spec compliant, how is that
a server bug? The case of SERVERFAULT of RESTOREFH I'm not sure what
to make of it. I think it's more of a spec failure to address. It
seems that server isn't allowed to fail after executing a
non-idempotent operation but that's a hard requirement. I still think
that client's best set of action is to ignore errors on RESTOREFH.

>  * If I make a client change, I don't want to have to consider whether
>    or not it causes a regression against some server that only 10
>    people are still using, and that never got a fix for a bug because
>    "the clients handle it".
>
> We should fix client bugs when it is clear that they are client bugs.
> We should push back hard on server bugs because the client is already a
> complex enough beast.
>
> --
> Trond Myklebust
> Linux NFS client maintainer, Hammerspace
> trond.myklebust@xxxxxxxxxxxxxxx
>
>



[Index of Archives]     [Linux Filesystem Development]     [Linux USB Development]     [Linux Media Development]     [Video for Linux]     [Linux NILFS]     [Linux Audio Users]     [Yosemite Info]     [Linux SCSI]

  Powered by Linux