On Tue, 2020-04-28 at 19:02 -0400, Olga Kornievskaia wrote: > On Tue, Apr 28, 2020 at 5:32 PM Trond Myklebust < > trondmy@xxxxxxxxxxxxxxx> wrote: > > On Tue, 2020-04-28 at 16:40 -0400, Olga Kornievskaia wrote: > > > On Tue, Apr 28, 2020 at 2:47 PM Trond Myklebust < > > > trondmy@xxxxxxxxxxxxxxx> wrote: > > > > Hi Olga, > > > > > > > > On Tue, 2020-04-28 at 14:14 -0400, Olga Kornievskaia wrote: > > > > > Hi folk, > > > > > > > > > > Looking for guidance on what folks think. A client is sending > > > > > a > > > > > LINK > > > > > operation to the server. This compound after the LINK has > > > > > RESTOREFH > > > > > and GETATTR. Server returns SERVER_FAULT to on RESTOREFH. But > > > > > LINK is > > > > > done successfully. Client still fails the system call with > > > > > EIO. > > > > > We > > > > > have a hardline and "ln" saying hardlink failed. > > > > > > > > > > Should the client not fail the system call in this case? The > > > > > fact > > > > > that > > > > > we couldn't get up-to-date attributes don't seem like the > > > > > reason > > > > > to > > > > > fail the system call? > > > > > > > > > > Thank you. > > > > > > > > I don't really see this as worth fixing on the client. It is > > > > very > > > > clearly a server bug. > > > > > > Why is that a server bug? A server can legitimately have an issue > > > trying to execute an operation (RESTOREFH) and legitimately > > > returning > > > an error. > > > > If it is happening consistently on the server, then it is a bug, > > and it > > gets reported by the client in the same way we always report > > NFS4ERR_SERVERFAULT, by converting to an EREMOTEIO. > > Yes but the client doesn't retry so it can't assess if it's > consistently happening or not. It can be a transient error (or > ENOMEM) > that's later resolved. If the server wants to signal a transient error, it should send NFS4ERR_DELAY. > > > NFS client also ignores errors of the returning GETATTR after the > > > RESTOREFH. So I'm not sure why we are then not ignoring errors > > > (or > > > some errors) of the RESTOREFH. > > > > We do need to check the value of RESTOREFH in order to figure out > > if we > > can continue reading the XDR buffer to decode the file attributes. > > We > > want to read those file attributes because we do expect the change > > attribute, the ctime and the nlinks values to all change as a > > result of > > the operation. > > I have nothing against decoding the error and using it in a decision > to keep decoding. But the client doesn't have to propagate the > RESTOREFH error to the application? > > In all other non-idempotent operations that have other operations (ie > GETATTR) following them, the client ignores the errors. Btw I just > noticed that on OPEN compound, since we ignore decode error from the > GETATTR, it would continue decoding LAYOUTGET... > > CREATE has problem if the following GETFH will return EDELAY. Client > doesn't deal with retrying a part of the compound. It retries the > whole compound. It leads to an error (since non-idempotent operation > is retried). But I guess that's a 2nd issue (or a 3rd if we could the > decoding layoutget).... > > All this is under the umbrella of how to handle errors on > non-idempotent operations in a compound.... There is no point in trying to handle errors that make no sense. If the server has a bug, then let's expose it instead of trying to hide it in the sofa cushions. -- Trond Myklebust Linux NFS client maintainer, Hammerspace trond.myklebust@xxxxxxxxxxxxxxx