On Fri, May 05, 2023 at 02:19:28PM -0700, Darrick J. Wong wrote: > On Thu, May 04, 2023 at 03:00:06PM +1000, Dave Chinner wrote: > > On Wed, May 03, 2023 at 06:38:16PM +0000, John Garry wrote: > > > Add support to create bio's whose bi_sector and bi_size are aligned to and > > > multiple of atomic_write_unit, respectively. > > > > > > When we call iomap_dio_bio_iter() -> bio_iov_iter_get_pages() -> > > > __bio_iov_iter_get_pages(), we trim the bio to a multiple of > > > atomic_write_unit. > > > > > > As such, we expect the iomi start and length to have same size and > > > alignment requirements per iomap_dio_bio_iter() call. > > > > > > In iomap_dio_bio_iter(), ensure that for a non-dsync iocb that the mapping > > > is not dirty nor unmapped. > > > > > > Signed-off-by: John Garry <john.g.garry@xxxxxxxxxx> > > > --- > > > fs/iomap/direct-io.c | 72 ++++++++++++++++++++++++++++++++++++++++++-- > > > 1 file changed, 70 insertions(+), 2 deletions(-) > > > > > > diff --git a/fs/iomap/direct-io.c b/fs/iomap/direct-io.c > > > index f771001574d0..37c3c926dfd8 100644 > > > --- a/fs/iomap/direct-io.c > > > +++ b/fs/iomap/direct-io.c > > > @@ -36,6 +36,8 @@ struct iomap_dio { > > > size_t done_before; > > > bool wait_for_completion; > > > > > > + unsigned int atomic_write_unit; > > > + > > > union { > > > /* used during submission and for synchronous completion: */ > > > struct { > > > @@ -229,9 +231,21 @@ static inline blk_opf_t iomap_dio_bio_opflags(struct iomap_dio *dio, > > > return opflags; > > > } > > > > > > + > > > +/* > > > + * Note: For atomic writes, each bio which we create when we iter should have > > > + * bi_sector aligned to atomic_write_unit and also its bi_size should be > > > + * a multiple of atomic_write_unit. > > > + * The call to bio_iov_iter_get_pages() -> __bio_iov_iter_get_pages() > > > + * should trim the length to a multiple of atomic_write_unit for us. > > > + * This allows us to split each bio later in the block layer to fit > > > + * request_queue limit. > > > + */ > > > static loff_t iomap_dio_bio_iter(const struct iomap_iter *iter, > > > struct iomap_dio *dio) > > > { > > > + bool atomic_write = (dio->iocb->ki_flags & IOCB_ATOMIC) && > > > + (dio->flags & IOMAP_DIO_WRITE); > > > const struct iomap *iomap = &iter->iomap; > > > struct inode *inode = iter->inode; > > > unsigned int fs_block_size = i_blocksize(inode), pad; > > > @@ -249,6 +263,14 @@ static loff_t iomap_dio_bio_iter(const struct iomap_iter *iter, > > > !bdev_iter_is_aligned(iomap->bdev, dio->submit.iter)) > > > return -EINVAL; > > > > > > + > > > + if (atomic_write && !iocb_is_dsync(dio->iocb)) { > > > + if (iomap->flags & IOMAP_F_DIRTY) > > > + return -EIO; > > > + if (iomap->type != IOMAP_MAPPED) > > > + return -EIO; > > > + } > > > > IDGI. If the iomap had space allocated for this dio iteration, > > then IOMAP_F_DIRTY will be set and it is likely (guaranteed for XFS) > > that the iomap type will be IOMAP_UNWRITTEN. Indeed, if we are doing > > a write into preallocated space (i.e. from fallocate()) then this > > will cause -EIO on all RWF_ATOMIC IO to that file unless RWF_DSYNC > > is also used. > > > > "For a power fail, for each individual application block, all or > > none of the data to be written." > > > > Ok, does this means RWF_ATOMIC still needs fdatasync() to guarantee > > that the data makes it to stable storage? And the result is > > undefined until fdatasync() is run, but the device will guarantee > > that either all or none of the data will be on stable storage > > prior to the next device cache flush completing? > > > > i.e. does REQ_ATOMIC imply REQ_FUA, or does it require a separate > > device cache flush to commit the atomic IO to stable storage? > > From the SCSI and NVME device information that I've been presented, it > sounds like an explicit cache flush or FUA is required to persist the > data. Ok, that makes it sound like RWF_ATOMIC has the same data integrity semantics as normal DIO submission. i.e. the application has to specify data integrity requirements and/or provide integrity checkpoints itself. > > What about ordering - do the devices guarantee strict ordering of > > REQ_ATOMIC writes? i.e. if atomic write N is seen on disk, then all > > the previous atomic writes up to N will also be seen on disk? If > > not, how does the application and filesystem guarantee persistence > > of completed atomic writes? > > I /think/ the applications have to ensure ordering themselves. If Y > cannot appear before X is persisted, then the application must wait for > the ack for X, flush the cache, and only then send Y. RIght, I'd expect that completion-to-submission ordering is required with RWF_ATOMIC the same way it is required for normal DIO, but I've been around long enough to know that we can't make assumptions about data integrity semantics... > > i.e. If we still need a post-IO device cache flush to guarantee > > persistence and/or ordering of RWF_ATOMIC IOs, then the above code > > makes no sense - we'll still need fdatasync() to provide persistence > > checkpoints and that means we ensure metadata is also up to date > > at those checkpoints. > > I'll let the block layer developers weigh in on this, but I /think/ this > means that we require RWF_DSYNC for atomic block writes to written > mappings, and RWF_SYNC if iomap_begin gives us an unwritten/hole/dirty > mapping. RWF_DSYNC is functionally the same as RWF_OSYNC. The only difference is that RWF_OSYNC considers timestamps as dirty metadata, whilst RWF_DSYNC doesn't. Hence I don't think there's any functional difference w.r.t. data integrity by using OSYNC vs DSYNC... > > > @@ -592,6 +634,32 @@ __iomap_dio_rw(struct kiocb *iocb, struct iov_iter *iter, > > > > > > blk_start_plug(&plug); > > > while ((ret = iomap_iter(&iomi, ops)) > 0) { > > > + if (atomic_write) { > > > + const struct iomap *_iomap = &iomi.iomap; > > > + loff_t iomi_length = iomap_length(&iomi); > > > + > > > + /* > > > + * Ensure length and start address is a multiple of > > > + * atomic_write_unit - this is critical. If the length > > > + * is not a multiple of atomic_write_unit, then we > > > + * cannot create a set of bio's in iomap_dio_bio_iter() > > > + * who are each a length which is a multiple of > > > + * atomic_write_unit. > > > + * > > > + * Note: It may be more appropiate to have this check > > > + * in iomap_dio_bio_iter() > > > + */ > > > + if ((iomap_sector(_iomap, iomi.pos) << SECTOR_SHIFT) % > > The file offset (and by extension the position) are not important for > deciding if we can issue an atomic write. Only the mapped LBA space on > the underlying device is important. > > IOWs, if we have a disk that can write a 64k aligned block atomically, > iomap only has to check that iomap->addr is aligned to a 64k boundary. > If that space happens to be mapped to file offset 57k, then it is indeed > possible to perform a 64k atomic write to the file starting at offset > 57k and ending at offset 121k, right? Yup, that was kinda what I was implying in pointing out that file offset does not reflect device IO alignment... > > Hence I think that we should be rejecting RWF_ATOMIC IOs that are > > larger than the maximum atomic write unit or cannot be dispatched in > > a single IO e.g. filesystem has allocated multiple minimum aligned > > extents and so a max len atomic write IO over that range must be > > broken up into multiple smaller IOs. > > > > We should be doing max atomic write size rejection high up in the IO > > path (e.g. filesystem ->write_iter() method) before we get anywhere > > near the DIO path, and we should be rejecting atomic write IOs in > > the DIO path during the ->iomap_begin() mapping callback if we can't > > map the entire atomic IO to a single aligned filesystem extent. > > > > i.e. the alignment checks and constraints need to be applied by the > > filesystem mapping code, not the layer that packs the pages into the > > bio as directed by the filesystem mapping.... > > Hmm. I think I see what you're saying here -- iomap should communicate > to ->iomap_begin that we want to perform an atomic write, and there had > better be either (a) a properly aligned mapping all ready to go; or (b) > the fs must perform an aligned allocation and map that in, or return no > mapping so the write fails. Exactly. This is how IOCB_NOWAIT works, too - we can reject it high up in the IO path if we can't get locks, and then if we have to do allocation in ->iomap_begin because there is no mapping available we reject the IO there. Hence I think we should use the same constraint checking model for RWF_ATOMIC - the constraints are slightly different, but the layers at which we can first resolve the various constraints are exactly the same... Cheers, Dave. -- Dave Chinner david@xxxxxxxxxxxxx