On Fri, Jul 31, 2020 at 1:44 PM Damien Le Moal <Damien.LeMoal@xxxxxxx> wrote: > > On 2020/07/31 16:59, Kanchan Joshi wrote: > > On Fri, Jul 31, 2020 at 12:29 PM Damien Le Moal <Damien.LeMoal@xxxxxxx> wrote: > >> > >> On 2020/07/31 15:45, hch@xxxxxxxxxxxxx wrote: > >>> On Fri, Jul 31, 2020 at 06:42:10AM +0000, Damien Le Moal wrote: > >>>>> - We may not be able to use RWF_APPEND, and need exposing a new > >>>>> type/flag (RWF_INDIRECT_OFFSET etc.) user-space. Not sure if this > >>>>> sounds outrageous, but is it OK to have uring-only flag which can be > >>>>> combined with RWF_APPEND? > >>>> > >>>> Why ? Where is the problem ? O_APPEND/RWF_APPEND is currently meaningless for > >>>> raw block device accesses. We could certainly define a meaning for these in the > >>>> context of zoned block devices. > >>> > >>> We can't just add a meaning for O_APPEND on block devices now, > >>> as it was previously silently ignored. I also really don't think any > >>> of these semantics even fit the block device to start with. If you > >>> want to work on raw zones use zonefs, that's what is exists for. > >> > >> Which is fine with me. Just trying to say that I think this is exactly the > >> discussion we need to start with. What interface do we implement... > >> > >> Allowing zone append only through zonefs as the raw block device equivalent, all > >> the O_APPEND/RWF_APPEND semantic is defined and the "return written offset" > >> implementation in VFS would be common for all file systems, including regular > >> ones. Beside that, there is I think the question of short writes... Not sure if > >> short writes can currently happen with async RWF_APPEND writes to regular files. > >> I think not but that may depend on the FS. > > > > generic_write_check_limits (called by generic_write_checks, used by > > most FS) may make it short, and AFAIK it does not depend on > > async/sync. > > Johannes has a patch (not posted yet) fixing all this for zonefs, > differentiating sync and async cases, allow short writes or not, etc. This was > done by not using generic_write_check_limits() and instead writing a > zonefs_check_write() function that is zone append friendly. > > We can post that as a base for the discussion on semantic if you want... There is no problem in about how-to-do-it. That part is simple - we have the iocb, and sync/async can be known whether ki_complete callback is set. This point to be discussed was whether-to-allow-short-write-or-not if we are talking about a generic file-append-returning-location. That said, since we are talking about moving to indirect-offset in io-uring, short-write is not an issue anymore I suppose (it goes back to how it was). But the unsettled thing is - whether we can use O/RWF_APPEND with indirect-offset (pointer) scheme. > > This was one of the reason why we chose to isolate the operation by a > > different IOCB flag and not by IOCB_APPEND alone. > > For zonefs, the plan is: > * For the sync write case, zone append is always used. > * For the async write case, if we see IOCB_APPEND, then zone append BIOs are > used. If not, regular write BIOs are used. > > Simple enough I think. No need for a new flag. Maybe simple if we only think of ZoneFS (how user-space sends async-append and gets result is a common problem). Add Block I/O in scope - it gets slightly more complicated because it has to cater to non-zoned devices. And there already is a well-established understanding that append does nothing...so code like "if (flags & IOCB_APPEND) { do something; }" in block I/O path may surprise someone resuming after a hiatus. Add File I/O in scope - It gets further complicated. I think it would make sense to make it opt-in rather than compulsory, but most of them already implement a behavior for IOCB_APPEND. How to make it opt-in without new flags. New flags (FMODE_SOME_NAME, IOCB_SOME_NAME) serve that purpose. Please assess the need (for isolation) considering all three cases.