Since the exact same report was sent to the ext4 list, CC'ing Dave's response to Ted and fstests, which seems an appropriate crowd for this discussion. I think that future CrashMonkey findings should be CC'ed to either fsdevel or fstests, especially if more than one filesystem is affected. Thanks, Amir. On Tue, Mar 13, 2018 at 6:21 AM, Dave Chinner <david@xxxxxxxxxxxxx> wrote: > On Mon, Mar 12, 2018 at 09:15:25PM -0500, Jayashree Mohan wrote: >> Hi, >> >> We've encountered what seems to be a crash consistency bug in >> xfs(kernel 4.15) due to the interaction between delayed allocated >> write and an unaligned fallocate(zero range) : Fallocate_zero_range is >> not persisted even when followed by a fsync - thereby leading to a >> loss of fsynced metadata operation on a file. >> >> Say we create a disk image with known data and quick format it. >> 1. Now write 65K of data to a new file >> 2. Zero out a part of the above file using falloc_zero_range (60K+128) >> - (60K+128+4096) - an unaligned block >> 3. fsync the above file >> 4. Crash > > So, effectively: > > $ xfs_io -xft -c "pwrite 0 65k" \ > -c "fzero $((60 * 1024 + 128)) 4k" \ > -c fsync \ > -c "shutdown" > /mnt/scratch/testfile > $ > >> >> If we crash after the fsync, and allow reordering of the block IOs >> between two flush/fua commands using Crashmonkey[1], then we can end >> up not persisting the zero_range command in step 2 above, even if we >> crashed after a fsync. > > Sorry, but what do you mean by "reordering of the block IOs"? > >> This workload was inspired from xfstest/generic_042, which tests for >> stale data exposure using aligned fallocate commands. It's worth >> noting that f2fs and btrfs passes our test clean - irrespective of the >> order of bios, user data is intact and fzero operation is correctly persisted. >> >> To reproduce this bug using CrashMonkey, simply run : >> ./c_harness -f /dev/sda -d /dev/cow_ram0 -t xfs -e 102400 -s 1000 -v >> tests/generic_042/generic_042_fzero_unaligned.so >> >> and take a look at the <timestamp>-generic_042_fzero_unaligned.log >> created in the build directory. This file has the list of block IOs >> issued during the workload and the permutation of bios that lead to >> this bug. You can also verify using blktrace that CrashMonkey only >> reorders bios between two barrier operations(thereby such a crash >> state could be encountered due to reordering blocks at the storage >> stack). Note that tools like dm-log-writes cannot capture this bug >> because this arises due to reordering blocks between barrier >> operations. >> >> Possible reason for this bug : >> On closely inspecting the reason for this bug, we discovered that the >> problem lies in updating a data block twice, without a barrier >> operation between the two updates. > > There's nothing wrong with doing that if the filesystem has waited > for completion of the first IOs before doing subsequent operations > on the block. > >> The blktrace for the above >> mentioned workload shows that sector 280 is updated twice - probably >> first due to the write and later due to the fzero operation. > > WHy is that happening? The trace I have from the above xfs_io > command is this: > > 8,96 2 613 0.081499295 3635 Q WS 280 + 16 [xfs_io] > 8,96 2 617 0.081982168 0 C WS 280 + 16 [0] > 8,96 2 618 0.082033206 3635 Q R 280 + 8 [xfs_io] > 8,96 2 622 0.082420539 0 C R 280 + 8 [0] > 8,96 2 623 0.082475920 3635 Q R 288 + 8 [xfs_io] > 8,96 2 627 0.082988642 0 C R 288 + 8 [0] > 8,96 2 628 0.083050032 3635 Q WS 160 + 136 [xfs_io] > 8,96 2 632 0.084414651 0 C WS 160 + 136 [0] > 8,96 2 633 0.084469733 3635 Q FWFSM 20971632 + 8 [xfs_io] > 8,96 2 636 0.107483279 0 C WFSM 20971632 + 8 [0] > 8,96 2 637 0.107485345 0 C WFSM 20971632 [0] > > Which shows serialised IO dispatch and completion of the IO, and no > double write. You've only shown queueing of writes, which doesn't > tell me about completion order which is what really matters. > > In more detail. > > 1. zero-range forces a data flush to occur and then waits for it > to complete. > > .... > xfs_file_fallocate > xfs_zero_file_space(60-68k) > xfs_free_file_space > xfs_flush_unmap_range() > filemap_write_and_wait_range() > xfs_writepage: pgoff 0xf000 size 0x10400 > xfs_writepage: pgoff 0x10000 size 0x10400 > 8,96 2 613 0.081499295 3635 Q WS 280 + 16 [xfs_io] > 8,96 2 617 0.081982168 0 C WS 280 + 16 [0] > truncate_pagecache_range() > xfs_invalidatepage: pgoff 0xf000 size 0x10400 > xfs_releasepage: pgoff 0xf000 size 0x10400 > xfs_invalidatepage: pgoff 0x10000 size 0x10400 > xfs_releasepage: pgoff 0x10000 size 0x10400 > xfs_zero_range() > 8,96 2 618 0.082033206 3635 Q R 280 + 8 [xfs_io] > 8,96 2 622 0.082420539 0 C R 280 + 8 [0] > 8,96 2 623 0.082475920 3635 Q R 288 + 8 [xfs_io] > 8,96 2 627 0.082988642 0 C R 288 + 8 [0] > xfs_alloc_file_space() > xfs_map_blocks_found: startoff 0x0 startblock 68 blockcount 0x30 > <nothing to allocate!> > ..... > > After the fzero command, we then run fsync: > > xfs_file_fsync() > filemap_write_and_wait_range() > xfs_writepage: pgoff 0x0 size 0x10400 > .... > xfs_writepage: pgoff 0x10000 size 0x10400 > 8,96 2 628 0.083050032 3635 Q WS 160 + 136 [xfs_io] > 8,96 2 632 0.084414651 0 C WS 160 + 136 [0] > <data writeback complete> > xfs_log_force_lsn() > xfs_buf_submit: flags WRITE|NO_IOACCT|ASYNC|SYNCIO|FUA|FLUSH|PAGES > 8,96 2 633 0.084469733 3635 Q FWFSM 20971632 + 8 [xfs_io] > 8,96 2 636 0.107483279 0 C WFSM 20971632 + 8 [0] > 8,96 2 637 0.107485345 0 C WFSM 20971632 [0] > > And that's the log force. That's all correct and expected behaviour > for pwrite, fzero and fsync. And when I dump the data in the file > after unmount/mount (because the shutdown stops all IO from the > filesytem), I get: > > 000000 146715 146715 146715 146715 146715 146715 146715 146715 > > 00f080 000000 000000 000000 000000 000000 000000 000000 000000 > * > 010080 146715 146715 146715 146715 146715 146715 146715 146715 > * > 010400 > > exactly the right data in the file. > >> However >> notice that there is no barrier between the two updates, and using >> Crashmonkey we see the above mentioned bug when these two updates to >> the same block are reordered. Essentially, a reordering here means >> fzero goes through first, and is later overwritten by the delayed >> allocated write. > >> 0.000179069 7104 Q WS 280 + 16 [c_harness] >> 0.000594994 7104 Q R 280 + 8 [c_harness] >> 0.000598216 7104 Q R 288 + 8 [c_harness] >> 0.000620552 7104 Q WS 160 + 120 [c_harness] >> 0.000653742 7104 Q WS 280 + 16 [c_harness] > > So this is where there's a difference. Note the write in my case: > > 0.083050032 3635 Q WS 160 + 136 [xfs_io] > > It's a single IO - for some reason the kernel you are running has > split it into two IOs, but that doesn't make it wrong or invalid - > it's not indicative of a bug in any way. > > What's wrong here is your assumption that you can reorder it around > the previous IO to those blocks. That's *simply not possible* > because the filesystem has waited for completion of each IO before > issuing the next. The IO serialisation is being done at the > filesystem level, not the IO level, and so CrashMonkey is > effectively breaking the completion ordering guarantees storage > hardware provides us with. > >> 0.000733154 7104 Q FWFSM 102466 + 3 [c_harness] > > IOWs, once this pre-flush is completed, the data on disk is supposed > to be that of the latter write because that one was received by the > storage well after the first one was completed. Reordering a later > IO in front of an already completed dependent IO (read or write!) is > simply not allowed. > >> This seems to be a bug, as it is not persisting the metadata operation >> even in the presence of fsync. >> >> Let me know if I am missing some detail here. > > Filesystem IO ordering in the linux stack is not based on > block/request level flush barriers. i.e. there's no such thing as > "IO barriers" anymore. Filesystem IO ordering is based on a > completion-to-dispatch serialisation model at the filesystem level, > and that cannot be observed by looking at block layer traces.... > > This really looks like a crashmonkey bug, not a filesystem problem... > > Cheers, > > Dave. > -- > Dave Chinner > david@xxxxxxxxxxxxx > -- > To unsubscribe from this list: send the line "unsubscribe linux-xfs" in > the body of a message to majordomo@xxxxxxxxxxxxxxx > More majordomo info at http://vger.kernel.org/majordomo-info.html -- To unsubscribe from this list: send the line "unsubscribe linux-xfs" in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at http://vger.kernel.org/majordomo-info.html