On Tue 09-01-24 17:05:00, Dave Chinner wrote: > On Mon, Jan 08, 2024 at 10:31:12PM +0100, Jan Kara wrote: > > Hello, > > > > I've found this when going through some old email. Were you able to debug > > this? > > > > On Wed 06-09-23 21:15:01, Free Ekanayaka wrote: > > > I'm using Linux 6.4.0 from Debian/testing (but tried this with 6.5.1 > > > too). > > > > > > I've created an ext4 filesystem with journalling disabled on an NVMe > > > drive: > > > > > > mkfs.ext4 -O ^has_journal -F /dev/nvme0n1p6 > > > > > > I have a program that creates and open a new file with O_DIRECT, and > > > sets its size to 8M with posix_fallocate(), something like: > > > > > > fd = open("/dir/file", O_CREAT | O_WRONLY | O_DIRECT); > > > posix_fallocate(fd, 0, 8 * 1024 * 1024); > > > fsync(fd); > > > dirfd = open("/dir", O_RDONLY | O_DIRECTORY); > > > fsync(dirfd); > > > > > > and then it uses io_uring to perform a single write of 4096 bytes at the > > > beginning of the file, passing RWF_DSYNC to the submitted > > > IORING_OP_WRITE_FIXED entry, > > > > > > I would expect the kernel to tell the NVMe device to actually flush the > > > write, not only buffer it. However I measured the end-to-end latency of > > > the io_uring operation and it was very low, as if the write was only > > > buffered by the NVMe device, but not flushed. > > > > Yes, the kernel should issue device cache flush or mark the IO as FUA. <snip> > > > raft-benchmark-37801 [003] ..... 9904.831134: ext4_ext_map_blocks_exit: dev 259,5 ino 12 flags CREATE|UNWRIT|PRE_IO lblk 0 pblk 32887 len 1 mflags NMU ret 1 > > > raft-benchmark-37801 [003] ..... 9904.831135: ext4_es_lookup_extent_enter: dev 259,5 ino 12 lblk 0 > > > raft-benchmark-37801 [003] ..... 9904.831135: ext4_es_lookup_extent_exit: dev 259,5 ino 12 found 1 [0/1) 32887 UR > > > raft-benchmark-37801 [003] ..... 9904.831136: ext4_es_insert_extent: dev 259,5 ino 12 es [0/1) mapped 32887 status U > > > raft-benchmark-37801 [003] ..... 9904.831143: block_bio_remap: 259,0 WS 498455480 + 8 <- (259,5) 263096 > > > raft-benchmark-37801 [003] ..... 9904.831144: block_bio_queue: 259,0 WS 498455480 + 8 [raft-benchmark] > > > raft-benchmark-37801 [003] ..... 9904.831149: block_getrq: 259,0 WS 498455480 + 8 [raft-benchmark] > > > > Here we can see the indeed the write was submitted without the cache flush. > > However we can also see that the write was going into unwritten extent > > so... > > > > > raft-benchmark-37801 [003] ..... 9904.831149: block_plug: [raft-benchmark] > > > raft-benchmark-37801 [003] ..... 9904.831153: nvme_setup_cmd: nvme0: disk=nvme0n1, qid=4, cmdid=25169, nsid=1, flags=0x0, meta=0x0, cmd=(nvme_cmd_write slba=498455480, len=7, ctrl=0x0, dsmgmt=0, reftag=0) > > > raft-benchmark-37801 [003] ..... 9904.831159: block_rq_issue: 259,0 WS 4096 () 498455480 + 8 [raft-benchmark] > > > raft-benchmark-37801 [003] d.h.. 9904.831173: nvme_sq: nvme0: disk=nvme0n1, qid=4, head=783, tail=783 > > > raft-benchmark-37801 [003] d.h.. 9904.831177: nvme_complete_rq: nvme0: disk=nvme0n1, qid=4, cmdid=25169, res=0x0, retries=0, flags=0x0, status=0x0 > > > raft-benchmark-37801 [003] d.h.. 9904.831178: block_rq_complete: 259,0 WS () 498455480 + 8 [0] > > > kworker/3:1-30279 [003] ..... 9904.831193: ext4_journal_start_inode: dev 259,5 blocks 8, rsv_blocks 0, revoke_creds 8, type 3, ino 12, caller ext4_convert_unwritten_extents+0xb4/0x260 [ext4] > > > > ... after io completed here, we need to convert unwritten extent into a > > written one. > > > > > kworker/3:1-30279 [003] ..... 9904.831193: ext4_es_lookup_extent_enter: dev 259,5 ino 12 lblk 0 > > > kworker/3:1-30279 [003] ..... 9904.831194: ext4_es_lookup_extent_exit: dev 259,5 ino 12 found 1 [0/1) 32887 U > > > kworker/3:1-30279 [003] ..... 9904.831194: ext4_ext_map_blocks_enter: dev 259,5 ino 12 lblk 0 len 1 flags CREATE|UNWRIT|CONVERT > > > kworker/3:1-30279 [003] ..... 9904.831195: ext4_es_cache_extent: dev 259,5 ino 12 es [0/1) mapped 32887 status U > > > kworker/3:1-30279 [003] ..... 9904.831195: ext4_ext_show_extent: dev 259,5 ino 12 lblk 0 pblk 32887 len 1 > > > kworker/3:1-30279 [003] ..... 9904.831196: ext4_ext_handle_unwritten_extents: dev 259,5 ino 12 m_lblk 0 m_pblk 32887 m_len 1 flags CREATE|UNWRIT|CONVERT|METADATA_NOFAIL allocated 1 newblock 32887 > > > kworker/3:1-30279 [003] ..... 9904.831196: ext4_mark_inode_dirty: dev 259,5 ino 12 caller ext4_ext_map_blocks+0xeee/0x1980 [ext4] > > > kworker/3:1-30279 [003] ..... 9904.831197: block_touch_buffer: 259,5 sector=135 size=4096 > > > kworker/3:1-30279 [003] ..... 9904.831198: block_dirty_buffer: 259,5 sector=135 size=4096 > > > kworker/3:1-30279 [003] ..... 9904.831199: ext4_ext_map_blocks_exit: dev 259,5 ino 12 flags CREATE|UNWRIT|CONVERT lblk 0 pblk 32887 len 1 mflags M ret 1 > > > kworker/3:1-30279 [003] ..... 9904.831199: ext4_es_insert_extent: dev 259,5 ino 12 es [0/1) mapped 32887 status W > > > kworker/3:1-30279 [003] ..... 9904.831200: ext4_mark_inode_dirty: dev 259,5 ino 12 caller ext4_convert_unwritten_extents+0x1e2/0x260 [ext4] > > > kworker/3:1-30279 [003] ..... 9904.831200: block_touch_buffer: 259,5 sector=135 size=4096 > > > kworker/3:1-30279 [003] ..... 9904.831201: block_dirty_buffer: 259,5 sector=135 size=4096 > > > > The conversion to written extents happened here. > > Which dirties metadata buffers.... Yes. > > > kworker/3:1-30279 [003] ..... 9904.831202: ext4_sync_file_enter: dev 259,5 ino 12 parent 2 datasync 1 > > > kworker/3:1-30279 [003] ..... 9904.831203: ext4_sync_file_exit: dev 259,5 ino 12 ret 0 > > > > And here we've called fdatasync() for the inode. Now this should have > > submitted a cache flush through blkdev_issue_flush() but that doesn't seem > > to happen. > > I suspect correct crash recovery behaviour here requires > multiple cache flushes to ensure the correct ordering or data vs > metadata updates. i.e: > > .... > data write completes > fdatasync() > cache flush to ensure data is on disk > if (dirty metadata) { > issue metadata write(s) for extent records and inode > .... > metadata write(s) complete > cache flush to ensure metadata is on disk > } > > If we don't flush the cache between the data write and the metadata > write(s) that marks the extent as written, we could have a state > after a power fail where the metadata writes hit the disk > before the data write and after the system comes back up that file > now it exposes stale data to the user. So when we are journalling, we end up doing this (we flush data disk before writing and flushing the transaction commit block in jbd2). When we are not doing journalling (which is the case here), our crash consistency guarantees are pretty weak. We want to guarantee that if fsync(2) successfully completed on the file before the crash, user should see the data there. But not much more - i.e., stale data exposure in case of crash is fully within what sysadmin should expect from a filesystem without a journal. After all even if we improved fsync(2) as you suggest, we'd still have normal page writeback where we'd have to separate data & metadata writes with cache flushes and I don't think the performace overhead is something people would be willing to pay. So yes, nojournal mode is unsafe in case of crash. It is there for people not caring about the filesystem after the crash, single user filesystems doing data verification in userspace and similar special usecases. Still, I think we want at least minimum fsync(2) guarantees if nothing else for backwards compatibility with ext2. Thanks for the comments anyway :) Honza -- Jan Kara <jack@xxxxxxxx> SUSE Labs, CR