On Thu, Dec 07, 2023 at 01:50:46AM -0500, Theodore Ts'o wrote: > On Thu, Mar 01, 2018 at 12:41:44PM +1100, Dave Chinner wrote: > > From: Dave Chinner <dchinner@xxxxxxxxxx> > > > > If we are doing direct IO writes with datasync semantics, we often > > have to flush metadata changes along with the data write. However, > > if we are overwriting existing data, there are no metadata changes > > that we need to flush. In this case, optimising the IO by using > > FUA write makes sense. > > > > We know from teh IOMAP_F_DIRTY flag as to whether a specific inode > > requires a metadata flush - this is currently used by DAX to ensure > > extent modi$fication as stable in page fault operations. For direct > > IO writes, we can use it to determine if we need to flush metadata > > or not once the data is on disk. > > Hi, > > I've gotten an inquiry from some engineers at Microsoft who would > really like it if ext4 could use FUA writes when doing O_DSYNC writes, > since this is soemthing that SQL Server uses. In the discussion for > this patch series back in 2018[1], ext4 hadn't yet converted over to > iomap for Direct I/O, and so adding this feature for ext4 wasn't > really practical. > > [1] https://lore.kernel.org/all/20180319160650.mavedzwienzgwgqi@xxxxxxxxxxxxxx/ > > Today, ext4 does use iomap for DIO, but an experiment seems to > indicate that something hasn't been wired up to enable FUA for O_DSYNC > writes. I've looked at fs/iomap/direct-io.c and it wasn't immediately > obvious what I need to add to enable this feature. I've actually looked briefly into this in the past for ext4 - IIRC the problem is that mtime updates during the write are causing the inode to get marked dirty. Yeah, typical trace from a RWF_DSYNC direct IO through ext4 is: xfs_io-6002 [007] 2986.814841: ext4_es_lookup_extent_enter: dev 7,0 ino 12 lblk 0 xfs_io-6002 [007] 2986.814848: ext4_es_lookup_extent_exit: dev 7,0 ino 12 found 1 [0/30720) 34304 W xfs_io-6002 [007] 2986.814856: ext4_journal_start_inode: dev 7,0 blocks 2, rsv_blocks 0, revoke_creds 8, type 1, ino 12, caller ext4_dirty_inode+0x39 xfs_io-6002 [007] 2986.814875: ext4_mark_inode_dirty: dev 7,0 ino 12 caller ext4_dirty_inode+0x5c xfs_io-6002 [007] 2986.814917: iomap_dio_rw_begin: dev 7:0 ino 0xc size 0x40000000 offset 0x0 length 0x20000 done_before 0x0 flags DIRECT dio_flags aio 0 xfs_io-6002 [007] 2986.814922: iomap_iter: dev 7:0 ino 0xc pos 0x0 length 0x0 flags WRITE|DIRECT (0x11) ops ext4_iomap_overwrite_ops caller __iomap_dio_rw+0x1c2 So I think the problem may be through file_modified() from ext4_dio_write_checks() triggering mtime updates which then comes back through mark_inode_dirty and that ends up causing the inode to be marked dirty and tracked by the journal. Yup, there it is in generic_update_time(): if (updated & (S_ATIME|S_MTIME|S_CTIME)) dirty_flags = inode->i_sb->s_flags & SB_LAZYTIME ? I_DIRTY_TIME : I_DIRTY_SYNC; it's setting the inode I_DIRTY_SYNC for mtime updates. XFS does not do this. It implements the ->update_time() method so doesn't use generic_update_time(). XFS tracks pure timestamp updates separately to other dirty inode metadata. We then elide "timestamp dirty" state from the IOMAP_F_DIRTY checks because O_DSYNC doesn't require timestamps to be persisted. FWIW, the patch below allows testing FUA optimisations on loop devices.... -Dave. -- Dave Chinner david@xxxxxxxxxxxxx loop: add support for FUA writes From: Dave Chinner <dchinner@xxxxxxxxxx> Any file-backed loop device on a filesysetm that support the ->fsync operation can do FUA writes. All a FUA write requires from the loop device is RWF_DSYNC semantics for the specific REQ_FUA request that is being processed. i.e. that it is on stable storage before the request completes. This allows simple testing of things like iomap FUA optimisations, and should also provide better performance for loop device in situations where the upper filesystem is issuing FUA writes to try to avoid needing journal flushes. Using RWF_DSYNC will also allow the lower filesystem to attempt to use REQ_FUA on loop devices using direct IO and potentially improve the IO patterns all the way down the stack as a result. Prep: # losetup --direct-io=on /dev/loop0 /mnt/scratch/foo.img # mkfs.xfs /dev/loop0 # mount /dev/loop0 /mnt/test # xfs_io -fdc "pwrite -b 2m 0 1g" -c fsync /mnt/test/foo # sync Measure: # xfs_io -fdc "pwrite -V 1 -D -b 128k 0 1m" -c fsync /mnt/test/foo Before: # grep . /sys/block/loop0/queue/* .... /sys/block/loop0/queue/fua:0 .... # trace-cmd stream -e block* |grep xfs_io ..... <...>-4654 [007] 809.018173: block_rq_issue: 7,0 WS 131072 () 192 + 256 [xfs_io] <...>-4654 [007] 809.025910: block_bio_queue: 7,0 WS 448 + 256 [xfs_io] <...>-4654 [007] 809.025917: block_getrq: 7,0 WS 448 + 256 [xfs_io] .... After: # grep . /sys/block/loop0/queue/* .... /sys/block/loop0/queue/fua:1 .... # trace-cmd stream -e block* |grep xfs_io ..... <...>-4648 [006] 793.968169: block_rq_issue: 7,0 WFS 131072 () 192 + 256 [xfs_io] <...>-4648 [006] 793.977521: block_bio_queue: 7,0 WFS 448 + 256 [xfs_io] <...>-4648 [006] 793.977531: block_getrq: 7,0 WFS 448 + 256 [xfs_io] The upper XFS filesystem is issuing REQ_FUA writes to the loop device now. Signed-off-by: Dave Chinner <dchinner@xxxxxxxxxx> --- drivers/block/loop.c | 29 ++++++++++++++++++++--------- 1 file changed, 20 insertions(+), 9 deletions(-) diff --git a/drivers/block/loop.c b/drivers/block/loop.c index 9f2d412fc560..6f2df09b3179 100644 --- a/drivers/block/loop.c +++ b/drivers/block/loop.c @@ -238,7 +238,8 @@ static void loop_set_size(struct loop_device *lo, loff_t size) kobject_uevent(&disk_to_dev(lo->lo_disk)->kobj, KOBJ_CHANGE); } -static int lo_write_bvec(struct file *file, struct bio_vec *bvec, loff_t *ppos) +static int lo_write_bvec(struct file *file, struct bio_vec *bvec, loff_t *ppos, + rwf_t rw_flags) { struct iov_iter i; ssize_t bw; @@ -246,7 +247,7 @@ static int lo_write_bvec(struct file *file, struct bio_vec *bvec, loff_t *ppos) iov_iter_bvec(&i, ITER_SOURCE, bvec, 1, bvec->bv_len); file_start_write(file); - bw = vfs_iter_write(file, &i, ppos, 0); + bw = vfs_iter_write(file, &i, ppos, rw_flags); file_end_write(file); if (likely(bw == bvec->bv_len)) @@ -261,14 +262,14 @@ static int lo_write_bvec(struct file *file, struct bio_vec *bvec, loff_t *ppos) } static int lo_write_simple(struct loop_device *lo, struct request *rq, - loff_t pos) + loff_t pos, rwf_t rw_flags) { struct bio_vec bvec; struct req_iterator iter; int ret = 0; rq_for_each_segment(bvec, rq, iter) { - ret = lo_write_bvec(lo->lo_backing_file, &bvec, &pos); + ret = lo_write_bvec(lo->lo_backing_file, &bvec, &pos, rw_flags); if (ret < 0) break; cond_resched(); @@ -392,7 +393,7 @@ static void lo_rw_aio_complete(struct kiocb *iocb, long ret) } static int lo_rw_aio(struct loop_device *lo, struct loop_cmd *cmd, - loff_t pos, int rw) + loff_t pos, int rw, rwf_t rw_flags) { struct iov_iter iter; struct req_iterator rq_iter; @@ -446,6 +447,7 @@ static int lo_rw_aio(struct loop_device *lo, struct loop_cmd *cmd, cmd->iocb.ki_filp = file; cmd->iocb.ki_complete = lo_rw_aio_complete; cmd->iocb.ki_flags = IOCB_DIRECT; + kiocb_set_rw_flags(&cmd->iocb, rw_flags); cmd->iocb.ki_ioprio = IOPRIO_PRIO_VALUE(IOPRIO_CLASS_NONE, 0); if (rw == ITER_SOURCE) @@ -464,6 +466,15 @@ static int do_req_filebacked(struct loop_device *lo, struct request *rq) { struct loop_cmd *cmd = blk_mq_rq_to_pdu(rq); loff_t pos = ((loff_t) blk_rq_pos(rq) << 9) + lo->lo_offset; + rwf_t rw_flags = 0; + + /* + * If we are being asked to do a REQ_FUA write, then we convert that to + * a datasync write so that the contents of the write are on stable + * storage before the write completes. + */ + if (rq->cmd_flags & REQ_FUA) + rw_flags = RWF_DSYNC; /* * lo_write_simple and lo_read_simple should have been covered @@ -490,12 +501,12 @@ static int do_req_filebacked(struct loop_device *lo, struct request *rq) return lo_fallocate(lo, rq, pos, FALLOC_FL_PUNCH_HOLE); case REQ_OP_WRITE: if (cmd->use_aio) - return lo_rw_aio(lo, cmd, pos, ITER_SOURCE); + return lo_rw_aio(lo, cmd, pos, ITER_SOURCE, rw_flags); else - return lo_write_simple(lo, rq, pos); + return lo_write_simple(lo, rq, pos, rw_flags); case REQ_OP_READ: if (cmd->use_aio) - return lo_rw_aio(lo, cmd, pos, ITER_DEST); + return lo_rw_aio(lo, cmd, pos, ITER_DEST, 0); else return lo_read_simple(lo, rq, pos); default: @@ -1077,7 +1088,7 @@ static int loop_configure(struct loop_device *lo, blk_mode_t mode, mapping_set_gfp_mask(mapping, lo->old_gfp_mask & ~(__GFP_IO|__GFP_FS)); if (!(lo->lo_flags & LO_FLAGS_READ_ONLY) && file->f_op->fsync) - blk_queue_write_cache(lo->lo_queue, true, false); + blk_queue_write_cache(lo->lo_queue, true, true); if (config->block_size) bsize = config->block_size;