On Tue, Nov 28, 2017 at 03:27:47PM -0600, Ashlie Martinez wrote: > > Unfortunately this timing bug only reproduces on some machines. Xiao > and I have been unable to reproduce this bug (I've tried kvm-xfstests, > my own kvm VMs, VMs without kvm, VMs with/without virtio drivers, and > another bare metal system). generic/456 basically sets up a race > condition between a kernel flusher thread and triggering dm-flakey, so > I think things like system load, core count, etc. might cause > different test results. Hmm, now I remember the details. It reproduced reliably on gce-xfstests, but I was able to use kvm-xfstests to debug the problem (by invocations of debugfs to dump the file system state as I had described). That's because debugfs operates on the buffer cache, and before the jbd2 commit, the changes to the inode structure are in the buffer cache, but they aren't allowed to be persisted on disk until after the journal commit. And I was using debugfs to dump the inode's extent tree (as it exists in the buffer cache) before triggering dm-flakey. Now that we understand what is happening, it should be simple to adjust the test so it reliably reproduces, by adding a "sleep 6" before _flakey_drop_and_remote. Since the delayed allocation write won't get resolved until 30 seconds after the inode was first dirtied, and the default jbd2 timer value is 5 seconds, this should guarantee that the jbd2 commit has taken place so that the inode changes made by fallocate are persisted onto the journal, while still allowing the delayed allocation write to be remain unresolved. Cheers, - Ted