Re: XFS crash consistency bug : Loss of fsynced metadata operation

Amir Goldstein <amir73il@xxxxxxxxx> · Tue, 13 Mar 2018 08:36:02 +0200

Since the exact same report was sent to the ext4 list, CC'ing Dave's response to
Ted and fstests, which seems an appropriate crowd for this discussion.

I think that future CrashMonkey findings should be CC'ed to either fsdevel
or fstests, especially if more than one filesystem is affected.

Thanks,
Amir.

On Tue, Mar 13, 2018 at 6:21 AM, Dave Chinner <david@xxxxxxxxxxxxx> wrote:
> On Mon, Mar 12, 2018 at 09:15:25PM -0500, Jayashree Mohan wrote:
>> Hi,
>>
>> We've encountered what seems to be a crash consistency bug in
>> xfs(kernel 4.15) due to the interaction between delayed allocated
>> write and an unaligned fallocate(zero range) : Fallocate_zero_range is
>> not persisted even when followed by a fsync - thereby leading to a
>> loss of fsynced metadata operation on a file.
>>
>> Say we create a disk image with known data and quick format it.
>> 1. Now write 65K of data to a new file
>> 2. Zero out a part of the above file using falloc_zero_range (60K+128)
>> - (60K+128+4096) - an unaligned block
>> 3. fsync the above file
>> 4. Crash
>
> So, effectively:
>
> $ xfs_io -xft -c "pwrite 0 65k" \
>         -c "fzero $((60 * 1024 + 128)) 4k" \
>         -c fsync \
>         -c "shutdown"
>         /mnt/scratch/testfile
> $
>
>>
>> If we crash after the fsync, and allow reordering of the block IOs
>> between two flush/fua commands using Crashmonkey[1], then we can end
>> up not persisting the zero_range command in step 2 above, even if we
>> crashed after a fsync.
>
> Sorry, but what do you mean by "reordering of the block IOs"?
>
>> This workload was inspired from xfstest/generic_042, which tests for
>> stale data exposure using aligned fallocate commands. It's worth
>> noting that f2fs and btrfs passes our test clean - irrespective of the
>> order of bios, user data is intact and fzero operation is correctly persisted.
>>
>> To reproduce this bug using CrashMonkey, simply run :
>> ./c_harness -f /dev/sda -d /dev/cow_ram0 -t xfs -e 102400 -s 1000 -v
>> tests/generic_042/generic_042_fzero_unaligned.so
>>
>> and take a look at the <timestamp>-generic_042_fzero_unaligned.log
>> created in the build directory. This file has the list of block IOs
>> issued during the workload and the permutation of bios that lead to
>> this bug. You can also verify using blktrace that CrashMonkey only
>> reorders bios between two barrier operations(thereby such a crash
>> state could be encountered due to reordering blocks at the storage
>> stack). Note that tools like dm-log-writes cannot capture this bug
>> because this arises due to reordering blocks between barrier
>> operations.
>>
>> Possible reason for this bug :
>> On closely inspecting the reason for this bug, we discovered that the
>> problem lies in updating a data block twice, without a barrier
>> operation between the two updates.
>
> There's nothing wrong with doing that if the filesystem has waited
> for completion of the first IOs before doing subsequent operations
> on the block.
>
>> The blktrace for the above
>> mentioned workload shows that sector 280 is updated twice - probably
>> first due to the write and later due to the fzero operation.
>
> WHy is that happening? The trace I have from the above xfs_io
> command is this:
>
>   8,96   2      613     0.081499295  3635  Q  WS 280 + 16 [xfs_io]
>   8,96   2      617     0.081982168     0  C  WS 280 + 16 [0]
>   8,96   2      618     0.082033206  3635  Q   R 280 + 8 [xfs_io]
>   8,96   2      622     0.082420539     0  C   R 280 + 8 [0]
>   8,96   2      623     0.082475920  3635  Q   R 288 + 8 [xfs_io]
>   8,96   2      627     0.082988642     0  C   R 288 + 8 [0]
>   8,96   2      628     0.083050032  3635  Q  WS 160 + 136 [xfs_io]
>   8,96   2      632     0.084414651     0  C  WS 160 + 136 [0]
>   8,96   2      633     0.084469733  3635  Q FWFSM 20971632 + 8 [xfs_io]
>   8,96   2      636     0.107483279     0  C WFSM 20971632 + 8 [0]
>   8,96   2      637     0.107485345     0  C WFSM 20971632 [0]
>
> Which shows serialised IO dispatch and completion of the IO, and no
> double write. You've only shown queueing of writes, which doesn't
> tell me about completion order which is what really matters.
>
> In more detail.
>
> 1. zero-range forces a data flush to occur and then waits for it
> to complete.
>
> ....
>   xfs_file_fallocate
>     xfs_zero_file_space(60-68k)
>       xfs_free_file_space
>         xfs_flush_unmap_range()
>           filemap_write_and_wait_range()
> xfs_writepage:        pgoff 0xf000 size 0x10400
> xfs_writepage:        pgoff 0x10000 size 0x10400
>   8,96   2      613     0.081499295  3635  Q  WS 280 + 16 [xfs_io]
>   8,96   2      617     0.081982168     0  C  WS 280 + 16 [0]
>           truncate_pagecache_range()
> xfs_invalidatepage:   pgoff 0xf000 size 0x10400
> xfs_releasepage:      pgoff 0xf000 size 0x10400
> xfs_invalidatepage:   pgoff 0x10000 size 0x10400
> xfs_releasepage:      pgoff 0x10000 size 0x10400
>         xfs_zero_range()
>   8,96   2      618     0.082033206  3635  Q   R 280 + 8 [xfs_io]
>   8,96   2      622     0.082420539     0  C   R 280 + 8 [0]
>   8,96   2      623     0.082475920  3635  Q   R 288 + 8 [xfs_io]
>   8,96   2      627     0.082988642     0  C   R 288 + 8 [0]
>     xfs_alloc_file_space()
> xfs_map_blocks_found:   startoff 0x0 startblock 68 blockcount 0x30
>         <nothing to allocate!>
> .....
>
> After the fzero command, we then run fsync:
>
> xfs_file_fsync()
>   filemap_write_and_wait_range()
> xfs_writepage:        pgoff 0x0 size 0x10400
> ....
> xfs_writepage:        pgoff 0x10000 size 0x10400
>   8,96   2      628     0.083050032  3635  Q  WS 160 + 136 [xfs_io]
>   8,96   2      632     0.084414651     0  C  WS 160 + 136 [0]
> <data writeback complete>
>   xfs_log_force_lsn()
> xfs_buf_submit:       flags WRITE|NO_IOACCT|ASYNC|SYNCIO|FUA|FLUSH|PAGES
>   8,96   2      633     0.084469733  3635  Q FWFSM 20971632 + 8 [xfs_io]
>   8,96   2      636     0.107483279     0  C WFSM 20971632 + 8 [0]
>   8,96   2      637     0.107485345     0  C WFSM 20971632 [0]
>
> And that's the log force. That's all correct and expected behaviour
> for pwrite, fzero and fsync. And when I dump the data in the file
> after unmount/mount (because the shutdown stops all IO from the
> filesytem), I get:
>
> 000000 146715 146715 146715 146715 146715 146715 146715 146715
>
> 00f080 000000 000000 000000 000000 000000 000000 000000 000000
> *
> 010080 146715 146715 146715 146715 146715 146715 146715 146715
> *
> 010400
>
> exactly the right data in the file.
>
>> However
>> notice that there is no barrier between the two updates, and using
>> Crashmonkey we see the above mentioned bug when these two updates to
>> the same block are reordered. Essentially, a reordering here means
>> fzero goes through first, and is later overwritten by the delayed
>> allocated write.
>
>>  0.000179069  7104  Q  WS 280 + 16 [c_harness]
>>  0.000594994  7104  Q   R 280 + 8 [c_harness]
>>  0.000598216  7104  Q   R 288 + 8 [c_harness]
>>  0.000620552  7104  Q  WS 160 + 120 [c_harness]
>>  0.000653742  7104  Q  WS 280 + 16 [c_harness]
>
> So this is where there's a difference. Note the write in my case:
>
> 0.083050032  3635  Q  WS 160 + 136 [xfs_io]
>
> It's a single IO - for some reason the kernel you are running has
> split it into two IOs, but that doesn't make it wrong or invalid -
> it's not indicative of a bug in any way.
>
> What's wrong here is your assumption that you can reorder it around
> the previous IO to those blocks.  That's *simply not possible*
> because the filesystem has waited for completion of each IO before
> issuing the next. The IO serialisation is being done at the
> filesystem level, not the IO level, and so CrashMonkey is
> effectively breaking the completion ordering guarantees storage
> hardware provides us with.
>
>>  0.000733154  7104  Q FWFSM 102466 + 3 [c_harness]
>
> IOWs, once this pre-flush is completed, the data on disk is supposed
> to be that of the latter write because that one was received by the
> storage well after the first one was completed. Reordering a later
> IO in front of an already completed dependent IO (read or write!) is
> simply not allowed.
>
>> This seems to be a bug, as it is not persisting the metadata operation
>> even in the presence of fsync.
>>
>> Let me know if I am missing some detail here.
>
> Filesystem IO ordering in the linux stack is not based on
> block/request level flush barriers. i.e. there's no such thing as
> "IO barriers" anymore. Filesystem IO ordering is based on a
> completion-to-dispatch serialisation model at the filesystem level,
> and that cannot be observed by looking at block layer traces....
>
> This really looks like a crashmonkey bug, not a filesystem problem...
>
> Cheers,
>
> Dave.
> --
> Dave Chinner
> david@xxxxxxxxxxxxx
> --
> To unsubscribe from this list: send the line "unsubscribe linux-xfs" in
> the body of a message to majordomo@xxxxxxxxxxxxxxx
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
--
To unsubscribe from this list: send the line "unsubscribe linux-xfs" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html