On Tue, Apr 28, 2020 at 7:56 PM Trond Myklebust <trondmy@xxxxxxxxxxxxxxx> wrote: > > On Tue, 2020-04-28 at 19:41 -0400, Trond Myklebust wrote: > > On Tue, 2020-04-28 at 19:02 -0400, Olga Kornievskaia wrote: > > > On Tue, Apr 28, 2020 at 5:32 PM Trond Myklebust < > > > trondmy@xxxxxxxxxxxxxxx> wrote: > > > > On Tue, 2020-04-28 at 16:40 -0400, Olga Kornievskaia wrote: > > > > > On Tue, Apr 28, 2020 at 2:47 PM Trond Myklebust < > > > > > trondmy@xxxxxxxxxxxxxxx> wrote: > > > > > > Hi Olga, > > > > > > > > > > > > On Tue, 2020-04-28 at 14:14 -0400, Olga Kornievskaia wrote: > > > > > > > Hi folk, > > > > > > > > > > > > > > Looking for guidance on what folks think. A client is > > > > > > > sending > > > > > > > a > > > > > > > LINK > > > > > > > operation to the server. This compound after the LINK has > > > > > > > RESTOREFH > > > > > > > and GETATTR. Server returns SERVER_FAULT to on RESTOREFH. > > > > > > > But > > > > > > > LINK is > > > > > > > done successfully. Client still fails the system call with > > > > > > > EIO. > > > > > > > We > > > > > > > have a hardline and "ln" saying hardlink failed. > > > > > > > > > > > > > > Should the client not fail the system call in this case? > > > > > > > The > > > > > > > fact > > > > > > > that > > > > > > > we couldn't get up-to-date attributes don't seem like the > > > > > > > reason > > > > > > > to > > > > > > > fail the system call? > > > > > > > > > > > > > > Thank you. > > > > > > > > > > > > I don't really see this as worth fixing on the client. It is > > > > > > very > > > > > > clearly a server bug. > > > > > > > > > > Why is that a server bug? A server can legitimately have an > > > > > issue > > > > > trying to execute an operation (RESTOREFH) and legitimately > > > > > returning > > > > > an error. > > > > > > > > If it is happening consistently on the server, then it is a bug, > > > > and it > > > > gets reported by the client in the same way we always report > > > > NFS4ERR_SERVERFAULT, by converting to an EREMOTEIO. > > > > > > Yes but the client doesn't retry so it can't assess if it's > > > consistently happening or not. It can be a transient error (or > > > ENOMEM) > > > that's later resolved. > > > > If the server wants to signal a transient error, it should send > > NFS4ERR_DELAY. > > > > > > > NFS client also ignores errors of the returning GETATTR after > > > > > the > > > > > RESTOREFH. So I'm not sure why we are then not ignoring errors > > > > > (or > > > > > some errors) of the RESTOREFH. > > > > > > > > We do need to check the value of RESTOREFH in order to figure out > > > > if we > > > > can continue reading the XDR buffer to decode the file > > > > attributes. > > > > We > > > > want to read those file attributes because we do expect the > > > > change > > > > attribute, the ctime and the nlinks values to all change as a > > > > result of > > > > the operation. > > > > > > I have nothing against decoding the error and using it in a > > > decision > > > to keep decoding. But the client doesn't have to propagate the > > > RESTOREFH error to the application? > > > > > > In all other non-idempotent operations that have other operations > > > (ie > > > GETATTR) following them, the client ignores the errors. Btw I just > > > noticed that on OPEN compound, since we ignore decode error from > > > the > > > GETATTR, it would continue decoding LAYOUTGET... > > > > > > CREATE has problem if the following GETFH will return EDELAY. > > > Client > > > doesn't deal with retrying a part of the compound. It retries the > > > whole compound. It leads to an error (since non-idempotent > > > operation > > > is retried). But I guess that's a 2nd issue (or a 3rd if we could > > > the > > > decoding layoutget).... > > > > > > All this is under the umbrella of how to handle errors on > > > non-idempotent operations in a compound.... > > > > There is no point in trying to handle errors that make no sense. If > > the > > server has a bug, then let's expose it instead of trying to hide it > > in > > the sofa cushions. > > It basically boils down to this: > * I do not want to have to maintain a list of server bugs in the > client. I also believe that client shouldn't be coded to a broken server. But in some of those cases, the client is not spec compliant, how is that a server bug? The case of SERVERFAULT of RESTOREFH I'm not sure what to make of it. I think it's more of a spec failure to address. It seems that server isn't allowed to fail after executing a non-idempotent operation but that's a hard requirement. I still think that client's best set of action is to ignore errors on RESTOREFH. > * If I make a client change, I don't want to have to consider whether > or not it causes a regression against some server that only 10 > people are still using, and that never got a fix for a bug because > "the clients handle it". > > We should fix client bugs when it is clear that they are client bugs. > We should push back hard on server bugs because the client is already a > complex enough beast. > > -- > Trond Myklebust > Linux NFS client maintainer, Hammerspace > trond.myklebust@xxxxxxxxxxxxxxx > >