On Thu, Nov 30, 2017 at 9:27 AM, Theodore Ts'o <tytso@xxxxxxx> wrote: > On Thu, Nov 30, 2017 at 08:51:45AM -0600, Ashlie Martinez wrote: >> >> Even though CrashMonkey *records* all the disk operations, it doesn't >> have to replay all of them when generating crash states. For example, >> it could choose to fully replay (and preserve the ordering of) all >> operations before the 3rd barrier operation in a trace with 5 >> different barrier operations in it (we dub each set of operations from >> just after the previous barrier operation up to and including the next >> barrier operation a "disk write epoch" or "disk epoch"). > > OK, but then it won't replay any operations after the 4th barrier > opration, correct? And in the case where you are stopping somewhere > between the 3rd and the 4th replay, you will drop and reorder random > operations after the 3rd barrier op, but before the 4th, correct? > > So what would be good is to understand where it stopped replaying > operations ---- and if you can get a strace -ttt of the workload, and > the fine-grained timestamps from block I/O trace, so we can understand > how far we had gotten in the workload. And then, to also to include > the output of debugfs's "logdump -ac" command before the journal is > replayed. > > From a file system developer's perspective, what is most useful is > when we can see the minimal reproduction test case where Crashmonkey > is only rearranging block I/O's of the last full "disk write epock". > And getting the timestampped strace logs and block I/O logs will help > us do that. > > Otherwise, we have no idea where to look for a potential problem, only > that it's one of the several I/O commands. These are good suggestions that I will work on adding into my current testing flow. Right now, I have CrashMonkey dump the recorded bios to a file in binary (so it can be hexdumped) and also record the bios that were used to generate each crash state it tests. > > > Anyway, looking back at your original question, is your question why > the first write isn't delay allocated? That's because the > collapse_range operation needs to resolve any delayed allocation > writes on the portion of the extent tree which will be affected by the > collapse_range operation. See the calls to > filemap_write_and_wait_range() in ext4_collapse_range(). I also ran a few more tests on this, it appears that when fsx does mapwrite operations it calls msync after mmap (for obvious reasons). Adding the mmap/msync operations to the end of my CrashMonkey workload produced output similar to Amir's reported output, but with slightly different values for i_blocks. There may be another interaction there that we hadn't thought of based on this information. The top fsck output below is from CrashMonkey with mmap/msync at the end of the workload. The fsck output below that is from Amir's original post about the bug. Note that my test is running on a much smaller file system than Amir's, so the logical/physical blocks will differ. e2fsck 1.42.13 (17-May-2015) Pass 1: Checking inodes, blocks, and sizes Inode 12, end of extent exceeds allowed value (logical block 131, physical block 8465, len 11) Clear? yes Inode 12, i_blocks is 168, should be 146. Fix? yes e2fsck 1.42.13 (17-May-2015) Pass 1: Checking inodes, blocks, and sizes Inode 12, end of extent exceeds allowed value (logical block 33, physical block 33441, len 7) Clear? no Inode 12, i_blocks is 184, should be 128. Fix? no > > Note that if you want to try to understand what is going on, there are > a large number of ext4 and jbd2 tracepoints. Enabling these > tracepoints (you may need to omit some of the much more chatty jbd2 > trace points from the ones that you enable) and dumping those > timestamps alongside the strace -ttt and block I/O timestamps should > be especially illuminating. > > Cheers, > > - Ted > > P.S. Another set of tracepoints that might be useful for > understanding when delayed write allocations are getting resolved are > the writeback tracepoints --- although you can probably infer those > from the ext4_writepages traces, since when the writeback is triggered > this will trigger calls to ext4_writepages. >