Re: handling ERR_SERVERFAULT on RESTOREFH

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



On Tue, 2020-04-28 at 19:02 -0400, Olga Kornievskaia wrote:
> On Tue, Apr 28, 2020 at 5:32 PM Trond Myklebust <
> trondmy@xxxxxxxxxxxxxxx> wrote:
> > On Tue, 2020-04-28 at 16:40 -0400, Olga Kornievskaia wrote:
> > > On Tue, Apr 28, 2020 at 2:47 PM Trond Myklebust <
> > > trondmy@xxxxxxxxxxxxxxx> wrote:
> > > > Hi Olga,
> > > > 
> > > > On Tue, 2020-04-28 at 14:14 -0400, Olga Kornievskaia wrote:
> > > > > Hi folk,
> > > > > 
> > > > > Looking for guidance on what folks think. A client is sending
> > > > > a
> > > > > LINK
> > > > > operation to the server. This compound after the LINK has
> > > > > RESTOREFH
> > > > > and GETATTR. Server returns SERVER_FAULT to on RESTOREFH. But
> > > > > LINK is
> > > > > done successfully. Client still fails the system call with
> > > > > EIO.
> > > > > We
> > > > > have a hardline and "ln" saying hardlink failed.
> > > > > 
> > > > > Should the client not fail the system call in this case? The
> > > > > fact
> > > > > that
> > > > > we couldn't get up-to-date attributes don't seem like the
> > > > > reason
> > > > > to
> > > > > fail the system call?
> > > > > 
> > > > > Thank you.
> > > > 
> > > > I don't really see this as worth fixing on the client. It is
> > > > very
> > > > clearly a server bug.
> > > 
> > > Why is that a server bug? A server can legitimately have an issue
> > > trying to execute an operation (RESTOREFH) and legitimately
> > > returning
> > > an error.
> > 
> > If it is happening consistently on the server, then it is a bug,
> > and it
> > gets reported by the client in the same way we always report
> > NFS4ERR_SERVERFAULT, by converting to an EREMOTEIO.
> 
> Yes but the client doesn't retry so it can't assess if it's
> consistently happening or not. It can be a transient error (or
> ENOMEM)
> that's later resolved.

If the server wants to signal a transient error, it should send
NFS4ERR_DELAY.

> > > NFS client also ignores errors of the returning GETATTR after the
> > > RESTOREFH. So I'm not sure why we are then not ignoring errors
> > > (or
> > > some errors) of the RESTOREFH.
> > 
> > We do need to check the value of RESTOREFH in order to figure out
> > if we
> > can continue reading the XDR buffer to decode the file attributes.
> > We
> > want to read those file attributes because we do expect the change
> > attribute, the ctime and the nlinks values to all change as a
> > result of
> > the operation.
> 
> I have nothing against decoding the error and using it in a decision
> to keep decoding. But the client doesn't have to propagate the
> RESTOREFH error to the application?
> 
> In all other non-idempotent operations that have other operations (ie
> GETATTR) following them, the client ignores the errors. Btw I just
> noticed that on OPEN compound, since we ignore decode error from the
> GETATTR, it would continue decoding LAYOUTGET...
> 
> CREATE has problem if the following GETFH will return EDELAY. Client
> doesn't deal with retrying a part of the compound. It retries the
> whole compound. It leads to an error (since non-idempotent operation
> is retried). But I guess that's a 2nd issue (or a 3rd if we could the
> decoding layoutget)....
> 
> All this is under the umbrella of how to handle errors on
> non-idempotent operations in a compound....

There is no point in trying to handle errors that make no sense. If the
server has a bug, then let's expose it instead of trying to hide it in
the sofa cushions.

-- 
Trond Myklebust
Linux NFS client maintainer, Hammerspace
trond.myklebust@xxxxxxxxxxxxxxx






[Index of Archives]     [Linux Filesystem Development]     [Linux USB Development]     [Linux Media Development]     [Video for Linux]     [Linux NILFS]     [Linux Audio Users]     [Yosemite Info]     [Linux SCSI]

  Powered by Linux