Re: Question about O_APPEND | O_DIRECT

Rick Macklem <rick.macklem@xxxxxxxxx> · Mon, 27 Nov 2023 17:50:49 -0800



On Mon, Nov 27, 2023 at 8:51 AM Chuck Lever III <chuck.lever@xxxxxxxxxx> wrote:
>
>
> > On Nov 27, 2023, at 11:36 AM, Christoph Hellwig <hch@xxxxxxxxxxxxx> wrote:
> >
> > On Mon, Nov 27, 2023 at 03:28:16PM +0000, Tao Lyu wrote:
> >>
> >> O_APPEND | O_DIRECT can be used to bypass the client cache for multiple threads writing data without caring of the orders (e.g., logs).
> >>
> >> Yes, to support O_APPEND | O_DIRECT, NFS must first support APPEND.
> >> But the key point is that looks like NFS has supported O_APPEND already.
> >> I can successfully open a file with "O_RDWR|O_APPEND".
> >>
> >> My confusion is why NFS supports O_RDWR and O_APPEND individually but does not support this combination.
>
> O_DIRECT is supposed to not depend on any cached information,
> including the file size, which the client needs to know to
> form an NFS WRITE with the correct offset to ensure it is an
> appending write.
>
> File sizes are managed on the server, so the server needs to
> know that the client is requesting an appending write so it
> knows where to put the payload.
>
>
> > Well, it does support O_RDWR|O_APPEND, just not with O_DIRECT?
> >
> > Btw, I think an APPEND operation in NFS would be a very good idea, and
> > I'd love to work with interested parties in the IETF on it.
It is not easy to deal with w.r.t. RPC retries.
I suppose a NFSv4.2 extension that either requires (or strongly
recommends) persistent sessions might work?
(Persistent sessions should pretty well guarantee an RPC is not
redone on the server.)

>
> You can write and submit a personal draft that describes it; it
> wouldn't need to be more than a few pages. The hard part of that
> would be accumulating use case descriptions.
>
> I think you could create a proof of concept by including a VERIFY
> operation in front of the WRITE to ensure the WRITE occurs only
> if the offset argument in the WRITE agrees with the file's size
> on the server. If the VERIFY fails, the client grabs the updated
> file size and tries again.
This is what the FreeBSD NFSv4 client does.
Since compounds are not atomic, it is not guaranteed to work and
you might get a lot of "tries again" if multiple clients were doing the
appends on the same file concurrently. (The compound includes a
GETTTR size before the VERIFY, so trying again is pretty straightforward.)

rick

>
>
> > Not that
> > we (Damien to be specific) plan to add support to Linux to also report
> > the actual offset an O_APPEND write wrote to through io_uring as we
> > have varios use cases for out of place write data stores for that.
> > It would be great to also support that programming model over NFS.
>
> --
> Chuck Lever
>
>