On 4/28/2020 10:06 PM, Olga Kornievskaia wrote:
On Tue, Apr 28, 2020 at 7:42 PM Trond Myklebust <trondmy@xxxxxxxxxxxxxxx> wrote:
On Tue, 2020-04-28 at 19:02 -0400, Olga Kornievskaia wrote:
On Tue, Apr 28, 2020 at 5:32 PM Trond Myklebust <
trondmy@xxxxxxxxxxxxxxx> wrote:
On Tue, 2020-04-28 at 16:40 -0400, Olga Kornievskaia wrote:
On Tue, Apr 28, 2020 at 2:47 PM Trond Myklebust <
trondmy@xxxxxxxxxxxxxxx> wrote:
Hi Olga,
On Tue, 2020-04-28 at 14:14 -0400, Olga Kornievskaia wrote:
Hi folk,
Looking for guidance on what folks think. A client is sending
a
LINK
operation to the server. This compound after the LINK has
RESTOREFH
and GETATTR. Server returns SERVER_FAULT to on RESTOREFH. But
LINK is
done successfully. Client still fails the system call with
EIO.
We
have a hardline and "ln" saying hardlink failed.
Should the client not fail the system call in this case? The
fact
that
we couldn't get up-to-date attributes don't seem like the
reason
to
fail the system call?
Thank you.
I don't really see this as worth fixing on the client. It is
very
clearly a server bug.
Why is that a server bug? A server can legitimately have an issue
trying to execute an operation (RESTOREFH) and legitimately
returning
an error.
If it is happening consistently on the server, then it is a bug,
and it
gets reported by the client in the same way we always report
NFS4ERR_SERVERFAULT, by converting to an EREMOTEIO.
Yes but the client doesn't retry so it can't assess if it's
consistently happening or not. It can be a transient error (or
ENOMEM)
that's later resolved.
If the server wants to signal a transient error, it should send
NFS4ERR_DELAY.
ERR_DELAY not an allowed error for the RESTOREFH. But let's say, the
server does return it, then client is not following the spec because
if it'll get this error, it will retry the whole compound (causing a
different error of redoing a non-idempotent operation). The spec says
client is responsible for handling partially completed compound. The
client should only retry the failed operations in a compound, I don't
see that client does that.
NFS client also ignores errors of the returning GETATTR after the
RESTOREFH. So I'm not sure why we are then not ignoring errors
(or
some errors) of the RESTOREFH.
We do need to check the value of RESTOREFH in order to figure out
if we
can continue reading the XDR buffer to decode the file attributes.
We
want to read those file attributes because we do expect the change
attribute, the ctime and the nlinks values to all change as a
result of
the operation.
I have nothing against decoding the error and using it in a decision
to keep decoding. But the client doesn't have to propagate the
RESTOREFH error to the application?
In all other non-idempotent operations that have other operations (ie
GETATTR) following them, the client ignores the errors. Btw I just
noticed that on OPEN compound, since we ignore decode error from the
GETATTR, it would continue decoding LAYOUTGET...
CREATE has problem if the following GETFH will return EDELAY. Client
doesn't deal with retrying a part of the compound. It retries the
whole compound. It leads to an error (since non-idempotent operation
is retried). But I guess that's a 2nd issue (or a 3rd if we could the
decoding layoutget)....
All this is under the umbrella of how to handle errors on
non-idempotent operations in a compound....
There is no point in trying to handle errors that make no sense. If the
server has a bug, then let's expose it instead of trying to hide it in
the sofa cushions.
EDELAY on GETFH is a reasonable error for the server to return.
I don't disagree that this is a broken server behavior. But from the
protocol perspective, I want to make two observations.
1) The post-operation attributes are not atomic, therefore an attribute
failure does not imply the operation was unsuccessful.
2) The application did not necessarily request the attributes, this
was inserted by the client, right? So again, their success or failure
is not actually relevant to the application.
Tom.