On Mon, Oct 16, 2017 at 10:32 PM, Ashlie Martinez <ashmrtn@xxxxxxxxxx> wrote: > Amir, > > I know this is a bit late, but I've spent some time working through > the disk image that you provided (so that I could determine how/if I > could modify CrashMonkey to catch errors like this) and I don't think > I understand what state the disk image reflects. The disk image SHOULD reflect a state on a disk after the power was cut in the middle of mounted fs. Then power came back on, filesystem was mounted, journal recovered, then filesystem was cleanly unmounted. At this stage, I don't expect there should be anything interesting in the journal. > After digging around > the journal of the disk image you provided, I found that the first 10 > journal blocks are used, with the journal superblock being placed in > the very first block of the journal. The journal superblock says that > the first journal transaction ID that should be in the journal is > transaction ID 4. However, dumping the other journal blocks, I found > that the next block is a descriptor block for transaction ID 2. The > rest of the journal blocks are data blocks for that transaction plus a > transaction commit block. This seems a little odd considering that the > journal refers to a 4th transaction, which I have not been able to > find (I quickly dumped the first 50 blocks in debugfs and found the > rest to contain only zeros). > I did not spend time analyzing the image, so I'll take your word for it, but I can't help you understand your findings. > With this in mind, I looked back at the xfstests code for controlling > the dm_flakey device. What I realized is the `nolockfs` flag is > provided both when it switches from the real device to the flakey > device that drops writes and when it switches from the flakey device > back to the real device. I know there is a call to umount once the > flakey device that drops writes is inserted, but do you think it is > possible that the flakey device is swapped back to the real device > before all the writes forced out by umount have made it to the flakey > device? I believe umount call should be blocked until all writes have been flushed out to flakey device. > Unfortunately I still don't have a local machine that is > capable of reproducing your test results and I have not made any gce > test appliance images to test this yet, so I'm not sure if this is a > valid theory. > Ted explained that the bug related to very specific timing of flusher thread vs. fallocate thread. I was under the impression that CrashMonkey can only reorder writes between recorded FLUSH requests, so I am not really sure how you intent to modify CrashMonkey to catch this bug. Cheers, Amir.