On Thu, Nov 30, 2017 at 9:40 AM, Ashlie Martinez <ashmrtn@xxxxxxxxxx> wrote: > On Thu, Nov 30, 2017 at 9:27 AM, Theodore Ts'o <tytso@xxxxxxx> wrote: >> On Thu, Nov 30, 2017 at 08:51:45AM -0600, Ashlie Martinez wrote: >>> >>> Even though CrashMonkey *records* all the disk operations, it doesn't >>> have to replay all of them when generating crash states. For example, >>> it could choose to fully replay (and preserve the ordering of) all >>> operations before the 3rd barrier operation in a trace with 5 >>> different barrier operations in it (we dub each set of operations from >>> just after the previous barrier operation up to and including the next >>> barrier operation a "disk write epoch" or "disk epoch"). >> >> OK, but then it won't replay any operations after the 4th barrier >> opration, correct? And in the case where you are stopping somewhere >> between the 3rd and the 4th replay, you will drop and reorder random >> operations after the 3rd barrier op, but before the 4th, correct? >> >> So what would be good is to understand where it stopped replaying >> operations ---- and if you can get a strace -ttt of the workload, and >> the fine-grained timestamps from block I/O trace, so we can understand >> how far we had gotten in the workload. And then, to also to include >> the output of debugfs's "logdump -ac" command before the journal is >> replayed. It appears that the statements I added to provide extra debugging have changed the timing on things and now I can't seem to get the same output. I will keep poking at this later when I have more free time and send you anything interesting that I find. Now that I know what format to send material in, it should make things a little easier. >> >> From a file system developer's perspective, what is most useful is >> when we can see the minimal reproduction test case where Crashmonkey >> is only rearranging block I/O's of the last full "disk write epock". >> And getting the timestampped strace logs and block I/O logs will help >> us do that. >> >> Otherwise, we have no idea where to look for a potential problem, only >> that it's one of the several I/O commands. > > These are good suggestions that I will work on adding into my current > testing flow. Right now, I have CrashMonkey dump the recorded bios to > a file in binary (so it can be hexdumped) and also record the bios > that were used to generate each crash state it tests. > >> >> >> Anyway, looking back at your original question, is your question why >> the first write isn't delay allocated? That's because the >> collapse_range operation needs to resolve any delayed allocation >> writes on the portion of the extent tree which will be affected by the >> collapse_range operation. See the calls to >> filemap_write_and_wait_range() in ext4_collapse_range(). > > I also ran a few more tests on this, it appears that when fsx does > mapwrite operations it calls msync after mmap (for obvious reasons). > Adding the mmap/msync operations to the end of my CrashMonkey workload > produced output similar to Amir's reported output, but with slightly > different values for i_blocks. There may be another interaction there > that we hadn't thought of based on this information. The top fsck > output below is from CrashMonkey with mmap/msync at the end of the > workload. The fsck output below that is from Amir's original post > about the bug. Note that my test is running on a much smaller file > system than Amir's, so the logical/physical blocks will differ. > > e2fsck 1.42.13 (17-May-2015) > Pass 1: Checking inodes, blocks, and sizes > Inode 12, end of extent exceeds allowed value > (logical block 131, physical block 8465, len 11) > Clear? yes > > Inode 12, i_blocks is 168, should be 146. Fix? yes > > > e2fsck 1.42.13 (17-May-2015) > Pass 1: Checking inodes, blocks, and sizes > Inode 12, end of extent exceeds allowed value > (logical block 33, physical block 33441, len 7) > Clear? no > Inode 12, i_blocks is 184, should be 128. Fix? no > > >> >> Note that if you want to try to understand what is going on, there are >> a large number of ext4 and jbd2 tracepoints. Enabling these >> tracepoints (you may need to omit some of the much more chatty jbd2 >> trace points from the ones that you enable) and dumping those >> timestamps alongside the strace -ttt and block I/O timestamps should >> be especially illuminating. >> >> Cheers, >> >> - Ted >> >> P.S. Another set of tracepoints that might be useful for >> understanding when delayed write allocations are getting resolved are >> the writeback tracepoints --- although you can probably infer those >> from the ext4_writepages traces, since when the writeback is triggered >> this will trigger calls to ext4_writepages. >>