Re: Crash consistency bug in ext4 - interaction between delalloc and fzero

Jayashree Mohan <jayashree2912@xxxxxxxxx> · Tue, 13 Mar 2018 10:05:57 -0500

Hi,
Thanks for the quick reply.

>> We've encountered what seems to be a crash consistency bug in
>> ext4(kernel 4.15) due to the interaction between delayed allocated
>> write and an unaligned fallocate(zero range). Say we create a disk
>> image with known data and quick format it.
>> 1. Now write 65K of data to a new file
>> 2. Zero out a part of the above file using falloc_zero_range (60K+128)
>> - (60K+128+4096) - an unaligned block
>> 3. fsync the above file
>> 4. Crash
>>
>> If we crash after the fsync, and allow reordering of the block IOs
>> between two flush/fua commands using Crashmonkey[1], then we can end
>> up zeroing the file range from (64K+128) to 65K, which should be
>> untouched by the fallocate command. We expect this region to contain
>> the  user written data in step 1 above.
>>
>> This workload was inspired from xfstest/generic_042, which tests for
>> stale data exposure using aligned fallocate commands. It's worth
>> noting that f2fs and btrfs passes our test clean - irrespective of the
>> order of bios, user data is intact in these filesystems.
>>
>> To reproduce this bug using CrashMonkey, simply run :
>> ./c_harness -f /dev/sda -d /dev/cow_ram0 -t ext4 -e 10240 -s 1000 -v
>> tests/generic_042/generic_042_fzero_unaligned.so
>
> Hmm, I do not seem to be able to reproduce this problem. However I am
> running in a virtual environment with Virtio disk so that might be the
> problem ? Sorry if I am missing something it's my first time trying
> crashmonkey.

By not being able to reproduce the problem, do you mean CrashMonkey
runs to completion and produces a summary block like this one, but
with all tests passed cleanly ?

Reordering tests ran 1000 tests with
passed cleanly: 936
passed fixed: 0
fsck required: 0
failed: 64
old file persisted: 0
file missing: 0
file data corrupted: 64
file metadata corrupted: 0
incorrect block count: 0
other: 0

If not could you tell me what the output is ?
We also run on a virtual environment - kvm or VirtualBox, so there
shouldn't be an issue with that.

> Also it's not yet clear to me we can zeroout the entire block instead of
> just a part of it because of the crash ? Unless it was actually zero
> before we wrote to it, so isn't it lost write rather than zeroout ?

Before we start the workload, we run a setup phase to fill up the
entire disk by writing known data(non zero) to a file and unlink the
file. Because, when we run the actual workload, we want to be reusing
these data blocks. So I am wondering the only way the block could be
zeroed out is due to the fzero command (because if the write was lost,
we should see stale data corresponding to the initial setup phase? )

> I think that comments from Dave are valid here as well I am not
> necessarily sure how this situation can happen anyway. So maybe we do
> have a bug there somewehere. I guess I'll know more once I am able to
> reproduce.

Thanks,
Jayashree