On Mon, Nov 27, 2023 at 8:51 AM Chuck Lever III <chuck.lever@xxxxxxxxxx> wrote: > > > > On Nov 27, 2023, at 11:36 AM, Christoph Hellwig <hch@xxxxxxxxxxxxx> wrote: > > > > On Mon, Nov 27, 2023 at 03:28:16PM +0000, Tao Lyu wrote: > >> > >> O_APPEND | O_DIRECT can be used to bypass the client cache for multiple threads writing data without caring of the orders (e.g., logs). > >> > >> Yes, to support O_APPEND | O_DIRECT, NFS must first support APPEND. > >> But the key point is that looks like NFS has supported O_APPEND already. > >> I can successfully open a file with "O_RDWR|O_APPEND". > >> > >> My confusion is why NFS supports O_RDWR and O_APPEND individually but does not support this combination. > > O_DIRECT is supposed to not depend on any cached information, > including the file size, which the client needs to know to > form an NFS WRITE with the correct offset to ensure it is an > appending write. > > File sizes are managed on the server, so the server needs to > know that the client is requesting an appending write so it > knows where to put the payload. > > > > Well, it does support O_RDWR|O_APPEND, just not with O_DIRECT? > > > > Btw, I think an APPEND operation in NFS would be a very good idea, and > > I'd love to work with interested parties in the IETF on it. It is not easy to deal with w.r.t. RPC retries. I suppose a NFSv4.2 extension that either requires (or strongly recommends) persistent sessions might work? (Persistent sessions should pretty well guarantee an RPC is not redone on the server.) > > You can write and submit a personal draft that describes it; it > wouldn't need to be more than a few pages. The hard part of that > would be accumulating use case descriptions. > > I think you could create a proof of concept by including a VERIFY > operation in front of the WRITE to ensure the WRITE occurs only > if the offset argument in the WRITE agrees with the file's size > on the server. If the VERIFY fails, the client grabs the updated > file size and tries again. This is what the FreeBSD NFSv4 client does. Since compounds are not atomic, it is not guaranteed to work and you might get a lot of "tries again" if multiple clients were doing the appends on the same file concurrently. (The compound includes a GETTTR size before the VERIFY, so trying again is pretty straightforward.) rick > > > > Not that > > we (Damien to be specific) plan to add support to Linux to also report > > the actual offset an O_APPEND write wrote to through io_uring as we > > have varios use cases for out of place write data stores for that. > > It would be great to also support that programming model over NFS. > > -- > Chuck Lever > >