On Wed, Aug 27, 2014 at 06:49:22PM +1000, Dave Chinner wrote: > On Tue, Aug 26, 2014 at 10:27:40AM -0700, Zach Brown wrote: > > On Tue, Aug 26, 2014 at 12:05:11PM -0400, Jeff Moyer wrote: > > > Benjamin LaHaise <bcrl@xxxxxxxxx> writes: > > > > > > > Does someone already have a simple test case we can add to the libaio test > > > > suite to verify this behaviour? > > > > > > I can't reproduce this problem using a loop device, which is what the > > > libaio test suite uses. Even when using real hardware, you have to have > > > disks that are slow enough in order for this to trigger reliably (or > > > at all). > > > > I wonder if you could use something like dm suspend to abuse indefinite > > latencies. > > > > > I could write a more targeted test within xfstests, but I don't think > > > that's strictly necessary (it would just make it more clear what the > > > expectations are, and maybe bump the hit rate percentage up). > > > > I think it'd be worth it (he says, not commiting *his* time). It would > > have been nice if a targeted test helped Dave raise the alarm > > immediately rather than gnaw away at his brain with inconsistent mostly > > unrelated failures for months. > > I'm not sure it's worth the effort. now we have two tests that have > triggered the same problem, I've been easily able to reproduce it > with 2 VMs with test/scratch image files sharing the same spindle. > i.e. run xfstests in one VM, run generic/323 in the other VM, and > it reproduces fairly easily. > > I'm just running it in a loop now to measure how successfully I'm > reproducing the problem, then I'll apply the fix and see if it gets > better. If it does get better, then I'll keep the patch around > locally until it is upstream, and then I'll shout whenever I see > this problem occur again.... Ok, so of 32 executions in a tight loop of generic/323, only 5 executions passed while 27 failed. With the patch suggested, it failed the first 5 executions, so I don't think it fixes the problem. BTW, generic/323 is pulling 8,000 read IOPS and 500MB/s from my single spindle. Methinks that the test file is resident in the BBWC on the RAID controller, which may be why nobody else is reproducing this problem.... Cheers, Dave. -- Dave Chinner david@xxxxxxxxxxxxx -- To unsubscribe from this list: send the line "unsubscribe fstests" in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at http://vger.kernel.org/majordomo-info.html