Re: two failing xfstests using xfs (no DAX)

Brian Foster <bfoster@xxxxxxxxxx> · Tue, 6 Oct 2015 10:38:54 -0400

On Sat, Oct 03, 2015 at 08:34:02AM +1000, Dave Chinner wrote:
> On Fri, Oct 02, 2015 at 11:49:41AM -0600, Ross Zwisler wrote:
> > Recently I've been trying to get a stable baseline for my DAX testing using
> > various filesystems, and in doing so I noticed a pair of tests that were
> > behaving badly when run on XFS without DAX.  These test failures happen in
> > both v4.2 and v4.3-rc3, though the signatures may vary a bit.
> > 
> > My testing setup is a kvm virtual machine with 8 GiB of its 16GiB of memory
> > reserved for PMEM using the mmap parameter (memmap=8G!8G) and with the
> > CONFIG_X86_PMEM_LEGACY config option enabled.  I've attached my full kernel
> > config to this mail.
> > 
> > The first test failure is generic/299, which consistently deadlocks in the XFS
> > code in both v4.2 and v4.3-rc3.  The stack traces presented in dmesg via "echo
> > w > /proc/sysrq-trigger" are consistent between these two kernel versions, and
> > can be found in the "generic_299.deadlock" attachment.
> 
> Yes, we've recently identified a AGF locking order problem on an
> older kernel that this looks like. We haven't found the root cause
> of it yet, but it's good to know that generic/299 seems to reproduce
> it. I'll run that in a loop to see if I can get it to fail here...
> 

First off, a quick rundown of where we're at so far:

- The deadlock occurs because a dio/aio write attempts a reverse ordered
  agf lock (e.g., lock agf3 -> lock agf0, assuming agcount == 4) and
  races with a truncate (doing something likelock agf0 -> agf3).
- The reverse ordered agf lock occurs in xfs_alloc_vextent() because
  xfs_alloc_space_available() (via xfs_alloc_fix_freelist()) indicates
  an AG can support an allocation, but subsequently fails to allocate
  later on in xfs_alloc_fix_minleft(). This causes the higher level code
  in xfs_alloc_vextent to wrap around to ag 0 when it should have either
  not locked ag3 or successfully allocated.

I pointed out that xfs_alloc_space_available() appears to incorporate
agf_flcount whereas xfs_alloc_fix_minleft() does not. Dave subsequently
pointed out that the flcount is factored out of the former via the
'min_free' parameter, so this calculation is actually consistent with
respect to the free list count.

The calculation in question is:

        /* do have enough free space remaining for the allocation? */
        available = (int)(pag->pagf_freeblks + pag->pagf_flcount -
                          min_free - args->total);
        if (available < (int)args->minleft)
                return false;

In the test case where I reproduce the hang, pag->pagf_flcount ==
min_free, which means that no free list fixups are needed and we can
simplify things and cancel those two fields out. This means we are
basically checking if (pagf_freeblks - args->total < args->minleft). In
my test case, pagf_freeblks == 3, args->total == 0 and args->minleft ==
3. args->total in this context gives the impression that it is the block
count of the requested allocation, but it is not always used in this
manner. Indeed, the immediately prior extent length check uses
args->minlen, which is effectively the requested extent length.

FWIW, incorporating args->minlen into this check also seems to resolve
the hang, but that leads to other questions with regard to args->total.
My impression is that args->total could more be something like a
prediction on how many blocks off the free list an allocation might
require, but it's really not clear to me because 1.) it's only used in
this calculation 2.) it doesn't appear to be consistently used from the
callers and 3.) the comments above or on the data structure definition
aren't very clear.

With regard to the callers, the args->total field makes its way down
from the associated xfs_bmapi_write() parameter, passed as 0 from
xfs_iomap_write_direct(). The other iomap callers pass 1 (the actual
extent length is a separate param). It's not really clear to me what the
difference is between the direct case and the others here, but fwiw,
even just passing 1 in the direct case is enough to avoid the deadlock
(but doesn't make it correct, obviously).

The xfs_symlink() case calculates and passes two separate block counts:
fs_blocks is presumably the extent length required for the actual
symlink and resblks is the overall block count reservation for the
transaction (incorporating blocks to allocate a new inode chunk, if
necessary, etc.). This xfs_bmapi_write() call passes fs_blocks as the
extent length and resblks as "total." resblks is updated as
sub-operations occur, so this usage appears to be a mechanism that
protects the individual allocations against the overall requirement of
the transaction. Note that resblks incorporates fs_blocks, so this also
covers the length of the current allocation request and is inconsistent
with the iomap usage. The xfs_qm_dqalloc() and xfs_dir_*() callers (via
da_args) appear to similarly pass along a resblks value that
incorporates the overall extent length.

On the contrary, the call from xfs_alloc_file_space() passes a 0 similar
to the iomap_write_direct() case. Furthermore, the xfs_bmap_btalloc()
code can set (or reset) args.total to minlen internally based on a free
list flag. Given the documentation for xbf_low, perhaps this code
assumes args.total > args.minlen and is effectively "resetting" it to
the minimum length? If so, that suggests that the 1/0 callers are the
incorrect callers and should be fixed to incorporate the extent
length..?

I need to stare at this more, but at the moment I can only speculate as
to what the appropriate use of this field is. Sorry for the long mail.
I wanted to get this itemized and hopefully fill any gaps or
misunderstandings along the way. Thoughts..?

Brian

> > The second test failure is xfs/083, which in v4.2 seems to fail with an XFS
> > assertion (I have XFS_DEBUG turned on):
> > 
> > XFS: Assertion failed: fs_is_ok, file: fs/xfs/libxfs/xfs_dir2_data.c, line: 168
> 
> No surprise:
> 
> $ grep 083 tests/xfs/group
> 083 dangerous_fuzzers
> $
> 
> Yup, it's expected to trigger corruptions and when a
> CONFIG_XFS_DEBUG=y kernel triggers a corruption warning it triggers
> an ASSERT failure ot allow debugging.  That particular corruption is
> being detected in the /block validation function/ that is run to
> detect corruptions in directory data blocks as they are read for
> disk (__xfs_dir3_data_check).
> 
> Any test that is not in the auto group is not expected to work
> reliably as a regression test. Any many are actively dangerous like
> this and will crash/panic machines when they hit whatever problem
> they were written to exercise. For regression test purposes, the
> test groups to run are:
> 
> # check -g quick
> 
> For a fast smoke test, and
> 
> # check -g auto
> 
> to run all the tests that should function correctly as regression
> tests.
> 
> > In v4.3, though, this same test seems to create some random memory corruption
> > in XFS.  I've hit at least two failure signatures that look nothing alike
> > except they both look like somebody corrupted memory.
> 
> There's no memory corruption evident. The hexdumps are of disk
> buffers and, well, they've been fuzzed by the test...
> 
> > [   53.636917] run fstests xfs/083 at 2015-10-02 11:24:09
> > [   53.760098] XFS (pmem0p2): Unmounting Filesystem
> > [   53.779642] XFS (pmem0p2): Mounting V4 Filesystem
> 
> You're using v4 XFS filesystems. It's only valid to use CRC enabled
> XFS filesystems ("V5 filesystems") on pmem devices so we can detect
> torn sector writes correctly.
> 
> I'd suggest upgrading xfsprogs to the latest (v4.2.0) as it
> defaults to creating CRC enabled filesystems.
> 
> Cheers,
> 
> Dave.
> -- 
> Dave Chinner
> david@xxxxxxxxxxxxx
> 
> _______________________________________________
> xfs mailing list
> xfs@xxxxxxxxxxx
> http://oss.sgi.com/mailman/listinfo/xfs

_______________________________________________
xfs mailing list
xfs@xxxxxxxxxxx
http://oss.sgi.com/mailman/listinfo/xfs