Re: BUG in xfs_trans_binval

Dave Chinner <david@xxxxxxxxxxxxx> · Wed, 30 Mar 2016 10:54:15 +1100

On Tue, Mar 29, 2016 at 07:15:53PM +0200, Olaf Hering wrote:
> During receiving a backup stream (netcal -l 12345 | tar xf -) the host
> crashed and rebooted, no idea why.

That's the likely cause of your problems, because....

> After reboot I tried to remove the received directory (rm -rf dir) and
> got this BUG:
> 
> "_xfs_buf_find: Block out of range: block 0x81ffff3f8, EOFS 0x7fffd000"

This will be caused by a corrupt block....

> Kernel is 4.5.0 from openSUSE Tumbleweed.
> dmesg is attached, I just realized it has the backtrace.
> 
> [    1.883626] sd 4:0:0:0: [sdb] Write cache: enabled, read cache: enabled, doesn't support DPO or FUA

So, write cache is enabled on the drive.

> [   20.083397] XFS (sdb1): Mounting V5 Filesystem
> [   20.291900] XFS (sdb1): Starting recovery (logdev: internal)
> [   25.285846] XFS (sdb1): Bad dir block magic!
> [   26.448027] XFS (sdb1): Ending recovery (logdev: internal)

And that's a big clue that something went badly wrong at the storage
level. Basically, after recovering a buffer from the log, it had an
invalid magic number for the type of buffer information being
recovered. In this case, the journal entry being recovered was for a
directory block in "single block" format. The magic number foundi in
the block after recovery of the transaction was not that of a
directory block in single block format.

The only way this can happen is if there is an underlying corruption
in the block prior to recovery starting. Given that the system
crashed and rebooted, it's entirely possible that initialisation of
the block never made it to persistent storage, or it was corrupted
on the way to disk by whatever caused the crash and reboot.

> [  130.489414] XFS (sdb1): _xfs_buf_find: Block out of range: block 0x81ffff3f8, EOFS 0x7fffd000 
> [  130.494271] XFS (sdb1): _xfs_buf_find: Block out of range: block 0x81ffff3f8, EOFS 0x7fffd000 

These occur because a bad sector address is being detected.

> [  130.489707]  [<ffffffff81395921>] dump_stack+0x63/0x82
> [  130.489715]  [<ffffffff8107d912>] warn_slowpath_common+0x82/0xc0
> [  130.489722]  [<ffffffff8107da0a>] warn_slowpath_null+0x1a/0x20
> [  130.489766]  [<ffffffffa0941b80>] _xfs_buf_find+0x350/0x3b0 [xfs]
> [  130.489824]  [<ffffffffa0941c0a>] xfs_buf_get_map+0x2a/0x2c0 [xfs]
> [  130.489876]  [<ffffffffa097026a>] xfs_trans_get_buf_map+0x11a/0x1c0 [xfs]
> [  130.489923]  [<ffffffffa0919040>] xfs_btree_get_bufs+0x50/0x60 [xfs]
> [  130.489961]  [<ffffffffa090283f>] xfs_alloc_fix_freelist+0x20f/0x3c0 [xfs]

And this location generating the out-of-range disk address indicates
that there may be a bad block number on the AGFL.

Given that none of these have triggered verifier failures on read
from disk, it makes me think that whatever has gone wrong in this
filesystem occurred before the crash and reboot, and smells somewhat
of memory corruption and/or misdirected writes.

Given that xfs_repair didn't warn about blocks on the AGFL being out
of range (which is checked), nor any other metadata linkage in the
filesystem pointing to a block out of range, nor did it warn about
directory blocks cwbeing corrupted or having invalid formats, this
is starting to look like an in-memory problem. Perhaps there is
still memory corruption occurring - the block out of range has a
single high bit set that puts it out of range. i.e.  when we mask of
the single bit that is out of range, 0x1ffff3f8 is a valid sector
address.

Can you run a memory tester on the machine?

Cheers,

Dave.
-- 
Dave Chinner
david@xxxxxxxxxxxxx

_______________________________________________
xfs mailing list
xfs@xxxxxxxxxxx
http://oss.sgi.com/mailman/listinfo/xfs