Re: ext4 fix for interaction between i_size, fallocate, and delalloc after a crash

Ashlie Martinez <ashmrtn@xxxxxxxxxx> · Thu, 30 Nov 2017 09:40:24 -0600

On Thu, Nov 30, 2017 at 9:27 AM, Theodore Ts'o <tytso@xxxxxxx> wrote:
> On Thu, Nov 30, 2017 at 08:51:45AM -0600, Ashlie Martinez wrote:
>>
>> Even though CrashMonkey *records* all the disk operations, it doesn't
>> have to replay all of them when generating crash states. For example,
>> it could choose to fully replay (and preserve the ordering of) all
>> operations before the 3rd barrier operation in a trace with 5
>> different barrier operations in it (we dub each set of operations from
>> just after the previous barrier operation up to and including the next
>> barrier operation a "disk write epoch" or "disk epoch").
>
> OK, but then it won't replay any operations after the 4th barrier
> opration, correct?  And in the case where you are stopping somewhere
> between the 3rd and the 4th replay, you will drop and reorder random
> operations after the 3rd barrier op, but before the 4th, correct?
>
> So what would be good is to understand where it stopped replaying
> operations ---- and if you can get a strace -ttt of the workload, and
> the fine-grained timestamps from block I/O trace, so we can understand
> how far we had gotten in the workload.  And then, to also to include
> the output of debugfs's "logdump -ac" command before the journal is
> replayed.
>
> From a file system developer's perspective, what is most useful is
> when we can see the minimal reproduction test case where Crashmonkey
> is only rearranging block I/O's of the last full "disk write epock".
> And getting the timestampped strace logs and block I/O logs will help
> us do that.
>
> Otherwise, we have no idea where to look for a potential problem, only
> that it's one of the several I/O commands.

These are good suggestions that I will work on adding into my current
testing flow. Right now, I have CrashMonkey dump the recorded bios to
a file in binary (so it can be hexdumped) and also record the bios
that were used to generate each crash state it tests.

>
>
> Anyway, looking back at your original question, is your question why
> the first write isn't delay allocated?  That's because the
> collapse_range operation needs to resolve any delayed allocation
> writes on the portion of the extent tree which will be affected by the
> collapse_range operation.  See the calls to
> filemap_write_and_wait_range() in ext4_collapse_range().

I also ran a few more tests on this, it appears that when fsx does
mapwrite operations it calls msync after mmap (for obvious reasons).
Adding the mmap/msync operations to the end of my CrashMonkey workload
produced output similar to Amir's reported output, but with slightly
different values for i_blocks. There may be another interaction there
that we hadn't thought of based on this information. The top fsck
output below is from CrashMonkey with mmap/msync at the end of the
workload. The fsck output below that is from Amir's original post
about the bug. Note that my test is running on a much smaller file
system than Amir's, so the logical/physical blocks will differ.

e2fsck 1.42.13 (17-May-2015)
Pass 1: Checking inodes, blocks, and sizes
Inode 12, end of extent exceeds allowed value
  (logical block 131, physical block 8465, len 11)
Clear? yes

Inode 12, i_blocks is 168, should be 146.  Fix? yes

e2fsck 1.42.13 (17-May-2015)
Pass 1: Checking inodes, blocks, and sizes
Inode 12, end of extent exceeds allowed value
       (logical block 33, physical block 33441, len 7)
Clear? no
Inode 12, i_blocks is 184, should be 128.  Fix? no

>
> Note that if you want to try to understand what is going on, there are
> a large number of ext4 and jbd2 tracepoints.  Enabling these
> tracepoints (you may need to omit some of the much more chatty jbd2
> trace points from the ones that you enable) and dumping those
> timestamps alongside the strace -ttt and block I/O timestamps should
> be especially illuminating.
>
> Cheers,
>
>                                                 - Ted
>
> P.S.  Another set of tracepoints that might be useful for
> understanding when delayed write allocations are getting resolved are
> the writeback tracepoints --- although you can probably infer those
> from the ext4_writepages traces, since when the writeback is triggered
> this will trigger calls to ext4_writepages.
>