On Tue, Nov 10, 2020 at 08:26:05PM +0900, Naohiro Aota wrote: > A ZONE_APPEND bio must follow hardware restrictions (e.g. not exceeding > max_zone_append_sectors) not to be split. bio_iov_iter_get_pages builds > such restricted bio using __bio_iov_append_get_pages if bio_op(bio) == > REQ_OP_ZONE_APPEND. > > To utilize it, we need to set the bio_op before calling > bio_iov_iter_get_pages(). This commit introduces IOMAP_F_ZONE_APPEND, so > that iomap user can set the flag to indicate they want REQ_OP_ZONE_APPEND > and restricted bio. > > Signed-off-by: Naohiro Aota <naohiro.aota@xxxxxxx> > --- > fs/iomap/direct-io.c | 41 +++++++++++++++++++++++++++++++++++------ > include/linux/iomap.h | 1 + > 2 files changed, 36 insertions(+), 6 deletions(-) > > diff --git a/fs/iomap/direct-io.c b/fs/iomap/direct-io.c > index c1aafb2ab990..f04572a55a09 100644 > --- a/fs/iomap/direct-io.c > +++ b/fs/iomap/direct-io.c > @@ -200,6 +200,34 @@ iomap_dio_zero(struct iomap_dio *dio, struct iomap *iomap, loff_t pos, > iomap_dio_submit_bio(dio, iomap, bio, pos); > } > > +/* > + * Figure out the bio's operation flags from the dio request, the > + * mapping, and whether or not we want FUA. Note that we can end up > + * clearing the WRITE_FUA flag in the dio request. > + */ > +static inline unsigned int > +iomap_dio_bio_opflags(struct iomap_dio *dio, struct iomap *iomap, bool use_fua) Hmm, just to check my understanding of what iomap has to do to support all this: When we're wanting to use a ZONE_APPEND command, the @iomap structure has to have IOMAP_F_ZONE_APPEND set in iomap->flags, iomap->type is set to IOMAP_MAPPED, but what should iomap->addr be set to? I gather from what I see in zonefs and the relevant NVME proposal that iomap->addr should be set to the (byte) address of the zone we want to append to? And if we do that, then bio->bi_iter.bi_sector will be set to sector address of iomap->addr, right? (I got lost trying to figure out how btrfs sets ->addr for appends.) Then when the IO completes, the block layer sets bio->bi_iter.bi_sector to wherever the drive told it that it actually wrote the bio, right? If that's true, then that implies that need_zeroout must always be false for an append operation, right? Does that also mean that the directio request has to be aligned to an fs block and not just the sector size? Can userspace send a directio append that crosses a zone boundary? If so, what happens if a direct append to a lower address fails but a direct append to a higher address succeeds? > +{ > + unsigned int opflags = REQ_SYNC | REQ_IDLE; > + > + if (!(dio->flags & IOMAP_DIO_WRITE)) { > + WARN_ON_ONCE(iomap->flags & IOMAP_F_ZONE_APPEND); > + return REQ_OP_READ; > + } > + > + if (iomap->flags & IOMAP_F_ZONE_APPEND) > + opflags |= REQ_OP_ZONE_APPEND; > + else > + opflags |= REQ_OP_WRITE; > + > + if (use_fua) > + opflags |= REQ_FUA; > + else > + dio->flags &= ~IOMAP_DIO_WRITE_FUA; > + > + return opflags; > +} > + > static loff_t > iomap_dio_bio_actor(struct inode *inode, loff_t pos, loff_t length, > struct iomap_dio *dio, struct iomap *iomap) > @@ -278,6 +306,13 @@ iomap_dio_bio_actor(struct inode *inode, loff_t pos, loff_t length, > bio->bi_private = dio; > bio->bi_end_io = iomap_dio_bio_end_io; > > + /* > + * Set the operation flags early so that bio_iov_iter_get_pages > + * can set up the page vector appropriately for a ZONE_APPEND > + * operation. > + */ > + bio->bi_opf = iomap_dio_bio_opflags(dio, iomap, use_fua); I'm also vaguely wondering how to communicate the write location back to the filesystem when the bio completes? btrfs handles the bio completion completely so it doesn't have a problem, but for other filesystems (cough future xfs cough) either we'd have to add a new callback for append operations; or I guess everyone could hook the bio endio. Admittedly that's not really your problem, and for all I know hch is already working on this. --D > + > ret = bio_iov_iter_get_pages(bio, dio->submit.iter); > if (unlikely(ret)) { > /* > @@ -292,14 +327,8 @@ iomap_dio_bio_actor(struct inode *inode, loff_t pos, loff_t length, > > n = bio->bi_iter.bi_size; > if (dio->flags & IOMAP_DIO_WRITE) { > - bio->bi_opf = REQ_OP_WRITE | REQ_SYNC | REQ_IDLE; > - if (use_fua) > - bio->bi_opf |= REQ_FUA; > - else > - dio->flags &= ~IOMAP_DIO_WRITE_FUA; > task_io_account_write(n); > } else { > - bio->bi_opf = REQ_OP_READ; > if (dio->flags & IOMAP_DIO_DIRTY) > bio_set_pages_dirty(bio); > } > diff --git a/include/linux/iomap.h b/include/linux/iomap.h > index 4d1d3c3469e9..1bccd1880d0d 100644 > --- a/include/linux/iomap.h > +++ b/include/linux/iomap.h > @@ -54,6 +54,7 @@ struct vm_fault; > #define IOMAP_F_SHARED 0x04 > #define IOMAP_F_MERGED 0x08 > #define IOMAP_F_BUFFER_HEAD 0x10 > +#define IOMAP_F_ZONE_APPEND 0x20 > > /* > * Flags set by the core iomap code during operations: > -- > 2.27.0 >