Re: handling ERR_SERVERFAULT on RESTOREFH

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



On Wed, 2020-04-29 at 10:50 -0400, Tom Talpey wrote:
> On 4/28/2020 10:06 PM, Olga Kornievskaia wrote:
> > On Tue, Apr 28, 2020 at 7:42 PM Trond Myklebust <
> > trondmy@xxxxxxxxxxxxxxx> wrote:
> > > On Tue, 2020-04-28 at 19:02 -0400, Olga Kornievskaia wrote:
> > > > On Tue, Apr 28, 2020 at 5:32 PM Trond Myklebust <
> > > > trondmy@xxxxxxxxxxxxxxx> wrote:
> > > > > On Tue, 2020-04-28 at 16:40 -0400, Olga Kornievskaia wrote:
> > > > > > On Tue, Apr 28, 2020 at 2:47 PM Trond Myklebust <
> > > > > > trondmy@xxxxxxxxxxxxxxx> wrote:
> > > > > > > Hi Olga,
> > > > > > > 
> > > > > > > On Tue, 2020-04-28 at 14:14 -0400, Olga Kornievskaia
> > > > > > > wrote:
> > > > > > > > Hi folk,
> > > > > > > > 
> > > > > > > > Looking for guidance on what folks think. A client is
> > > > > > > > sending
> > > > > > > > a
> > > > > > > > LINK
> > > > > > > > operation to the server. This compound after the LINK
> > > > > > > > has
> > > > > > > > RESTOREFH
> > > > > > > > and GETATTR. Server returns SERVER_FAULT to on
> > > > > > > > RESTOREFH. But
> > > > > > > > LINK is
> > > > > > > > done successfully. Client still fails the system call
> > > > > > > > with
> > > > > > > > EIO.
> > > > > > > > We
> > > > > > > > have a hardline and "ln" saying hardlink failed.
> > > > > > > > 
> > > > > > > > Should the client not fail the system call in this
> > > > > > > > case? The
> > > > > > > > fact
> > > > > > > > that
> > > > > > > > we couldn't get up-to-date attributes don't seem like
> > > > > > > > the
> > > > > > > > reason
> > > > > > > > to
> > > > > > > > fail the system call?
> > > > > > > > 
> > > > > > > > Thank you.
> > > > > > > 
> > > > > > > I don't really see this as worth fixing on the client. It
> > > > > > > is
> > > > > > > very
> > > > > > > clearly a server bug.
> > > > > > 
> > > > > > Why is that a server bug? A server can legitimately have an
> > > > > > issue
> > > > > > trying to execute an operation (RESTOREFH) and legitimately
> > > > > > returning
> > > > > > an error.
> > > > > 
> > > > > If it is happening consistently on the server, then it is a
> > > > > bug,
> > > > > and it
> > > > > gets reported by the client in the same way we always report
> > > > > NFS4ERR_SERVERFAULT, by converting to an EREMOTEIO.
> > > > 
> > > > Yes but the client doesn't retry so it can't assess if it's
> > > > consistently happening or not. It can be a transient error (or
> > > > ENOMEM)
> > > > that's later resolved.
> > > 
> > > If the server wants to signal a transient error, it should send
> > > NFS4ERR_DELAY.
> > 
> > ERR_DELAY not an allowed error for the RESTOREFH. But let's say,
> > the
> > server does return it, then client is not following the spec
> > because
> > if it'll get this error, it will retry the whole compound (causing
> > a
> > different error of redoing a non-idempotent operation). The spec
> > says
> > client is responsible for handling partially completed compound.
> > The
> > client should only retry the failed operations in a compound, I
> > don't
> > see that client does that.
> > 
> > > > > > NFS client also ignores errors of the returning GETATTR
> > > > > > after the
> > > > > > RESTOREFH. So I'm not sure why we are then not ignoring
> > > > > > errors
> > > > > > (or
> > > > > > some errors) of the RESTOREFH.
> > > > > 
> > > > > We do need to check the value of RESTOREFH in order to figure
> > > > > out
> > > > > if we
> > > > > can continue reading the XDR buffer to decode the file
> > > > > attributes.
> > > > > We
> > > > > want to read those file attributes because we do expect the
> > > > > change
> > > > > attribute, the ctime and the nlinks values to all change as a
> > > > > result of
> > > > > the operation.
> > > > 
> > > > I have nothing against decoding the error and using it in a
> > > > decision
> > > > to keep decoding. But the client doesn't have to propagate the
> > > > RESTOREFH error to the application?
> > > > 
> > > > In all other non-idempotent operations that have other
> > > > operations (ie
> > > > GETATTR) following them, the client ignores the errors. Btw I
> > > > just
> > > > noticed that on OPEN compound, since we ignore decode error
> > > > from the
> > > > GETATTR, it would continue decoding LAYOUTGET...
> > > > 
> > > > CREATE has problem if the following GETFH will return EDELAY.
> > > > Client
> > > > doesn't deal with retrying a part of the compound. It retries
> > > > the
> > > > whole compound. It leads to an error (since non-idempotent
> > > > operation
> > > > is retried). But I guess that's a 2nd issue (or a 3rd if we
> > > > could the
> > > > decoding layoutget)....
> > > > 
> > > > All this is under the umbrella of how to handle errors on
> > > > non-idempotent operations in a compound....
> > > 
> > > There is no point in trying to handle errors that make no sense.
> > > If the
> > > server has a bug, then let's expose it instead of trying to hide
> > > it in
> > > the sofa cushions.
> > 
> > EDELAY on GETFH is a reasonable error for the server to return.
> 
> I don't disagree that this is a broken server behavior. But from the
> protocol perspective, I want to make two observations.
> 
> 1) The post-operation attributes are not atomic, therefore an
> attribute
> failure does not imply the operation was unsuccessful.
> 
> 2) The application did not necessarily request the attributes, this
> was inserted by the client, right? So again, their success or failure
> is not actually relevant to the application.
> 

Understood, however if a server vendor doesn't want to take
responsibility for their product bug, then I don't want the Linux
client to have to own their problem in perpetuity.

-- 
Trond Myklebust
Linux NFS client maintainer, Hammerspace
trond.myklebust@xxxxxxxxxxxxxxx






[Index of Archives]     [Linux Filesystem Development]     [Linux USB Development]     [Linux Media Development]     [Video for Linux]     [Linux NILFS]     [Linux Audio Users]     [Yosemite Info]     [Linux SCSI]

  Powered by Linux