On Wed, 2020-04-29 at 10:50 -0400, Tom Talpey wrote: > On 4/28/2020 10:06 PM, Olga Kornievskaia wrote: > > On Tue, Apr 28, 2020 at 7:42 PM Trond Myklebust < > > trondmy@xxxxxxxxxxxxxxx> wrote: > > > On Tue, 2020-04-28 at 19:02 -0400, Olga Kornievskaia wrote: > > > > On Tue, Apr 28, 2020 at 5:32 PM Trond Myklebust < > > > > trondmy@xxxxxxxxxxxxxxx> wrote: > > > > > On Tue, 2020-04-28 at 16:40 -0400, Olga Kornievskaia wrote: > > > > > > On Tue, Apr 28, 2020 at 2:47 PM Trond Myklebust < > > > > > > trondmy@xxxxxxxxxxxxxxx> wrote: > > > > > > > Hi Olga, > > > > > > > > > > > > > > On Tue, 2020-04-28 at 14:14 -0400, Olga Kornievskaia > > > > > > > wrote: > > > > > > > > Hi folk, > > > > > > > > > > > > > > > > Looking for guidance on what folks think. A client is > > > > > > > > sending > > > > > > > > a > > > > > > > > LINK > > > > > > > > operation to the server. This compound after the LINK > > > > > > > > has > > > > > > > > RESTOREFH > > > > > > > > and GETATTR. Server returns SERVER_FAULT to on > > > > > > > > RESTOREFH. But > > > > > > > > LINK is > > > > > > > > done successfully. Client still fails the system call > > > > > > > > with > > > > > > > > EIO. > > > > > > > > We > > > > > > > > have a hardline and "ln" saying hardlink failed. > > > > > > > > > > > > > > > > Should the client not fail the system call in this > > > > > > > > case? The > > > > > > > > fact > > > > > > > > that > > > > > > > > we couldn't get up-to-date attributes don't seem like > > > > > > > > the > > > > > > > > reason > > > > > > > > to > > > > > > > > fail the system call? > > > > > > > > > > > > > > > > Thank you. > > > > > > > > > > > > > > I don't really see this as worth fixing on the client. It > > > > > > > is > > > > > > > very > > > > > > > clearly a server bug. > > > > > > > > > > > > Why is that a server bug? A server can legitimately have an > > > > > > issue > > > > > > trying to execute an operation (RESTOREFH) and legitimately > > > > > > returning > > > > > > an error. > > > > > > > > > > If it is happening consistently on the server, then it is a > > > > > bug, > > > > > and it > > > > > gets reported by the client in the same way we always report > > > > > NFS4ERR_SERVERFAULT, by converting to an EREMOTEIO. > > > > > > > > Yes but the client doesn't retry so it can't assess if it's > > > > consistently happening or not. It can be a transient error (or > > > > ENOMEM) > > > > that's later resolved. > > > > > > If the server wants to signal a transient error, it should send > > > NFS4ERR_DELAY. > > > > ERR_DELAY not an allowed error for the RESTOREFH. But let's say, > > the > > server does return it, then client is not following the spec > > because > > if it'll get this error, it will retry the whole compound (causing > > a > > different error of redoing a non-idempotent operation). The spec > > says > > client is responsible for handling partially completed compound. > > The > > client should only retry the failed operations in a compound, I > > don't > > see that client does that. > > > > > > > > NFS client also ignores errors of the returning GETATTR > > > > > > after the > > > > > > RESTOREFH. So I'm not sure why we are then not ignoring > > > > > > errors > > > > > > (or > > > > > > some errors) of the RESTOREFH. > > > > > > > > > > We do need to check the value of RESTOREFH in order to figure > > > > > out > > > > > if we > > > > > can continue reading the XDR buffer to decode the file > > > > > attributes. > > > > > We > > > > > want to read those file attributes because we do expect the > > > > > change > > > > > attribute, the ctime and the nlinks values to all change as a > > > > > result of > > > > > the operation. > > > > > > > > I have nothing against decoding the error and using it in a > > > > decision > > > > to keep decoding. But the client doesn't have to propagate the > > > > RESTOREFH error to the application? > > > > > > > > In all other non-idempotent operations that have other > > > > operations (ie > > > > GETATTR) following them, the client ignores the errors. Btw I > > > > just > > > > noticed that on OPEN compound, since we ignore decode error > > > > from the > > > > GETATTR, it would continue decoding LAYOUTGET... > > > > > > > > CREATE has problem if the following GETFH will return EDELAY. > > > > Client > > > > doesn't deal with retrying a part of the compound. It retries > > > > the > > > > whole compound. It leads to an error (since non-idempotent > > > > operation > > > > is retried). But I guess that's a 2nd issue (or a 3rd if we > > > > could the > > > > decoding layoutget).... > > > > > > > > All this is under the umbrella of how to handle errors on > > > > non-idempotent operations in a compound.... > > > > > > There is no point in trying to handle errors that make no sense. > > > If the > > > server has a bug, then let's expose it instead of trying to hide > > > it in > > > the sofa cushions. > > > > EDELAY on GETFH is a reasonable error for the server to return. > > I don't disagree that this is a broken server behavior. But from the > protocol perspective, I want to make two observations. > > 1) The post-operation attributes are not atomic, therefore an > attribute > failure does not imply the operation was unsuccessful. > > 2) The application did not necessarily request the attributes, this > was inserted by the client, right? So again, their success or failure > is not actually relevant to the application. > Understood, however if a server vendor doesn't want to take responsibility for their product bug, then I don't want the Linux client to have to own their problem in perpetuity. -- Trond Myklebust Linux NFS client maintainer, Hammerspace trond.myklebust@xxxxxxxxxxxxxxx