On Fri, Oct 02, 2015 at 11:49:41AM -0600, Ross Zwisler wrote: > Recently I've been trying to get a stable baseline for my DAX testing using > various filesystems, and in doing so I noticed a pair of tests that were > behaving badly when run on XFS without DAX. These test failures happen in > both v4.2 and v4.3-rc3, though the signatures may vary a bit. > > My testing setup is a kvm virtual machine with 8 GiB of its 16GiB of memory > reserved for PMEM using the mmap parameter (memmap=8G!8G) and with the > CONFIG_X86_PMEM_LEGACY config option enabled. I've attached my full kernel > config to this mail. > > The first test failure is generic/299, which consistently deadlocks in the XFS > code in both v4.2 and v4.3-rc3. The stack traces presented in dmesg via "echo > w > /proc/sysrq-trigger" are consistent between these two kernel versions, and > can be found in the "generic_299.deadlock" attachment. Yes, we've recently identified a AGF locking order problem on an older kernel that this looks like. We haven't found the root cause of it yet, but it's good to know that generic/299 seems to reproduce it. I'll run that in a loop to see if I can get it to fail here... > The second test failure is xfs/083, which in v4.2 seems to fail with an XFS > assertion (I have XFS_DEBUG turned on): > > XFS: Assertion failed: fs_is_ok, file: fs/xfs/libxfs/xfs_dir2_data.c, line: 168 No surprise: $ grep 083 tests/xfs/group 083 dangerous_fuzzers $ Yup, it's expected to trigger corruptions and when a CONFIG_XFS_DEBUG=y kernel triggers a corruption warning it triggers an ASSERT failure ot allow debugging. That particular corruption is being detected in the /block validation function/ that is run to detect corruptions in directory data blocks as they are read for disk (__xfs_dir3_data_check). Any test that is not in the auto group is not expected to work reliably as a regression test. Any many are actively dangerous like this and will crash/panic machines when they hit whatever problem they were written to exercise. For regression test purposes, the test groups to run are: # check -g quick For a fast smoke test, and # check -g auto to run all the tests that should function correctly as regression tests. > In v4.3, though, this same test seems to create some random memory corruption > in XFS. I've hit at least two failure signatures that look nothing alike > except they both look like somebody corrupted memory. There's no memory corruption evident. The hexdumps are of disk buffers and, well, they've been fuzzed by the test... > [ 53.636917] run fstests xfs/083 at 2015-10-02 11:24:09 > [ 53.760098] XFS (pmem0p2): Unmounting Filesystem > [ 53.779642] XFS (pmem0p2): Mounting V4 Filesystem You're using v4 XFS filesystems. It's only valid to use CRC enabled XFS filesystems ("V5 filesystems") on pmem devices so we can detect torn sector writes correctly. I'd suggest upgrading xfsprogs to the latest (v4.2.0) as it defaults to creating CRC enabled filesystems. Cheers, Dave. -- Dave Chinner david@xxxxxxxxxxxxx _______________________________________________ xfs mailing list xfs@xxxxxxxxxxx http://oss.sgi.com/mailman/listinfo/xfs