Re: Crash consistency bug in ext4 - interaction between delalloc and fzero

Lukas Czerner <lczerner@xxxxxxxxxx> · Wed, 14 Mar 2018 07:27:18 +0100

On Tue, Mar 13, 2018 at 10:05:57AM -0500, Jayashree Mohan wrote:
> Hi,
> Thanks for the quick reply.
> 
> >> We've encountered what seems to be a crash consistency bug in
> >> ext4(kernel 4.15) due to the interaction between delayed allocated
> >> write and an unaligned fallocate(zero range). Say we create a disk
> >> image with known data and quick format it.
> >> 1. Now write 65K of data to a new file
> >> 2. Zero out a part of the above file using falloc_zero_range (60K+128)
> >> - (60K+128+4096) - an unaligned block
> >> 3. fsync the above file
> >> 4. Crash
> >>
> >> If we crash after the fsync, and allow reordering of the block IOs
> >> between two flush/fua commands using Crashmonkey[1], then we can end
> >> up zeroing the file range from (64K+128) to 65K, which should be
> >> untouched by the fallocate command. We expect this region to contain
> >> the  user written data in step 1 above.
> >>
> >> This workload was inspired from xfstest/generic_042, which tests for
> >> stale data exposure using aligned fallocate commands. It's worth
> >> noting that f2fs and btrfs passes our test clean - irrespective of the
> >> order of bios, user data is intact in these filesystems.
> >>
> >> To reproduce this bug using CrashMonkey, simply run :
> >> ./c_harness -f /dev/sda -d /dev/cow_ram0 -t ext4 -e 10240 -s 1000 -v
> >> tests/generic_042/generic_042_fzero_unaligned.so
> >
> > Hmm, I do not seem to be able to reproduce this problem. However I am
> > running in a virtual environment with Virtio disk so that might be the
> > problem ? Sorry if I am missing something it's my first time trying
> > crashmonkey.
> 
> By not being able to reproduce the problem, do you mean CrashMonkey
> runs to completion and produces a summary block like this one, but
> with all tests passed cleanly ?
> 
> Reordering tests ran 1000 tests with
> passed cleanly: 936
> passed fixed: 0
> fsck required: 0
> failed: 64
> old file persisted: 0
> file missing: 0
> file data corrupted: 64
> file metadata corrupted: 0
> incorrect block count: 0
> other: 0
> 
> If not could you tell me what the output is ?
> We also run on a virtual environment - kvm or VirtualBox, so there
> shouldn't be an issue with that.

Ah, ok. The output tricked me, you're right it does fail for me in the
same way.

> 
> 
> > Also it's not yet clear to me we can zeroout the entire block instead of
> > just a part of it because of the crash ? Unless it was actually zero
> > before we wrote to it, so isn't it lost write rather than zeroout ?
> 
> Before we start the workload, we run a setup phase to fill up the
> entire disk by writing known data(non zero) to a file and unlink the
> file. Because, when we run the actual workload, we want to be reusing
> these data blocks. So I am wondering the only way the block could be
> zeroed out is due to the fzero command (because if the write was lost,
> we should see stale data corresponding to the initial setup phase? )

Ok, good to know.

Thanks!
-Lukas

> 
> 
> > I think that comments from Dave are valid here as well I am not
> > necessarily sure how this situation can happen anyway. So maybe we do
> > have a bug there somewehere. I guess I'll know more once I am able to
> > reproduce.
> 
> 
> Thanks,
> Jayashree