Re: Metadata CRC error detected at xfs_dir3_block_read_verify+0x9e/0xc0 [xfs], xfs_dir3_block block 0x86f58

Dave Chinner <david@xxxxxxxxxxxxx> · Mon, 14 Mar 2022 09:46:24 +1100

On Sun, Mar 13, 2022 at 04:47:19PM +0100, Manfred Spraul wrote:
> Hello together,
> 
> 
> after a simulated power failure, I have observed:
> 
> >>>
> 
> Metadata CRC error detected at xfs_dir3_block_read_verify+0x9e/0xc0 [xfs],
> xfs_dir3_block block 0x86f58
> [14768.047531] XFS (loop0): Unmount and run xfs_repair
> [14768.047534] XFS (loop0): First 128 bytes of corrupted metadata buffer:
> [14768.047537] 00000000: 58 44 42 33 9f ab d7 f4 00 00 00 00 00 08 6f 58 
> XDB3..........oX

For future reference, please paste the entire log message, from
the time that the fs was mounted to the end of the hexdump output.
You might not think the hexdump output is important, but as you'll
see later....

> <<<
> 
> Is this a known issue?

Is what a known issue? All this is XFS finding a corrupt metadata
block because a CRC is invalid, which is exactly what it's supposed
to do.

As it is, CRC errors are indicative of storage problem such as bit
errors and torn writes, because what has been read from disk does
not match what XFS wrote when it calculated the CRC.

> The image file is here: https://github.com/manfred-colorfu/nbd-datalog-referencefiles/blob/main/xfs-02/result/data-1821799.img.xz
> 
> As first question:
> 
> Are 512 byte sectors supported, or does xfs assume that 4096 byte writes are
> atomic?

512 byte *IO* is supported on devices that have 512 byte sector
support, but there are other rules that XFS sets for metadata. e.g.
that metadata writes are expected to be written completely or
replayed completely as a whole unit regardless of their length. This
is bookended by the use of cache flushes and FUAs to ensure that
multi-sector writes are wholly completed before the recovery
information is tossed away.

If a cache flush has not been issued, then the metadata block
recvoery information is whole in the journal, and so if we crash or
lose power then journal recovery replays the changes and overwrites
whatever is on the disk with the correct, consistent metadata.

Log recovery will also issue large writes and cache flushes will
occur as part of the process so that the recovered metadata is whole
on stable storage before it is removed from the journal.

IOWs, if the storage ends up doing a partial write as a result of a
power failure, log recovery fixes that up if it is still in the
journal. If it is not in the journal then a cache flush *must* have
happened, and hence the metadata is complete on disk.

So....

> How were the power failures simulated:
> 
> I added support to nbd to log all write operations, including the written
> data. This got merged into nbd-3.24
> 
> I've used that to create a log of running dbench (+ a few tar/rm/manual
> tests) on a 500 MB image file.
> 
> In total, 2.9 mio 512-byte sector writes. The datalog is ~1.5 GB long.
> 
> If replaying the initial 1,821,799, 1,821,800, 1,821,801 or 1,821,802
> blocks, the above listed error message is shown.
> 
> After 1,821,799 or 1,821,803 sectors, everything is ok.
> 
> (block numbers are 0-based)
> 
> > > H=2400000047010000 C=0x00000001 (NBD_CMD_WRITE+NONE)
> > O=0000000010deb000 L=00001000
> > block 1821795 (0x1bcc63): writing to offset 283029504 (0x10deb000), len
> > 512 (0x200).
> > block 1821796 (0x1bcc64): writing to offset 283030016 (0x10deb200), len
> > 512 (0x200).
> > block 1821797 (0x1bcc65): writing to offset 283030528 (0x10deb400), len
> > 512 (0x200).  << OK
> > block 1821798 (0x1bcc66): writing to offset 283031040 (0x10deb600), len
> > 512 (0x200).  FAIL
> > block 1821799 (0x1bcc67): writing to offset 283031552 (0x10deb800), len
> > 512 (0x200).  FAIL
> > block 1821800 (0x1bcc68): writing to offset 283032064 (0x10deba00), len
> > 512 (0x200).  FAIL
> > block 1821801 (0x1bcc69): writing to offset 283032576 (0x10debc00), len
> > 512 (0x200).  FAIL
> > block 1821802 (0x1bcc6a): writing to offset 283033088 (0x10debe00), len
> > 512 (0x200). << OK

OK, this test is explicitly tearing writes at the storage level.
When there is an update to multiple sectors of the metadata block,
the metadata will be inconsistent on disk while those individual
sector writes are replayed.

For example the problem here is likely the LSN that this write
stamps into the header along with the updated CRC. Log recovery
doesn't actually check the incoming CRC because it might be invalid
(say, due to a torn write) but it does check the magic number and
then the LSN that is stamped into the metadata block to determine if
it should be replayed or not (i.e. we have metadata version
checks in recovery).

If the LSN that is stamped into the header is more recent that the
object version that log recovery is trying to replay, it will skip
replay because that can result in unnecessary transient corruption
of the metadata on disk that doesn't get corrected until later in
the recovery process. This is bad - if log recovery then fails
before we recover then more recent changes, we've created new
on-disk corruption and made things worse, not better....

So, let's find the log recovery lsn (same in all images) via
logprint - it's last logged as part of this transaction:

LOG REC AT LSN cycle 15 block 604 (0xf, 0x25c)                                   
============================================================================     
TRANS: tid:0x6d1b8e4f  #items:201  trans:0x6d1b8e4f q:0x5608eeb23bd0 

And there are 3 data regions in it:

BUF: cnt:4 total:4 a:0x5608eeb23f60 len:24 a:0x5608eeb20f70 len:128 a:0x5608eeb23970 len:384 a:0x5608eeb22130 len:256 
        BUF:  #regs:4   start blkno:0x86f58   len:8   bmap size:1   flags:0x5000 
        BUF DATA                                                                 
        BUF DATA                                                                 
        BUF DATA 

The three regions are 128 bytes, 384 bytes and 256 bytes long. The
first chunk is clearly the first 128 bytes of the sector:

40       69    4123c    85000    86f58        0        1 c000001d 4f8e1b6d       
            buf item             daddr			  TID
48 80000000       69 33424458 b21e33d9        0 586f0800  e000000 29580000       
   ophdr flags   ID  XDB3     CRC   		daddr     CYCLE    BLCK
50 23355a53 f14c2c57 b07cac8d b7eca938        0 25690800 400d2802 30006801       
58 3000c801        0        0 25690800    22e01 40000000        0 4d0a0c00       
60  22e2e02 50000000        0 26690800 5244430b 534c4f52 4746432e 60000001       
68        0 27690800 4f8e1b6d

So, when this item was logged, the LSN in the in memory buffer was
(0xe,0x5829), and it is being replayed at (0xf,0x25c). That's good,
it indicates what is in the journal is valid but what is in the
block on disk?

xfs_db> daddr 0x86f58 
xfs_db> p
000: 58444233 9fabd7f4 00000000 00086f58 0000000f 0000025c 535a3523 572c4cf1
     magic    crc               daddr     block    cycle   ....

Oh, I didn't need to get it off disk like this - it's in the second
line of the hexdump output in the corruption reports:

[15063.024355] XFS (loop0): Metadata CRC error detected at xfs_dir3_block_read_verify+0x9e/0xc0 [xfs], xfs_dir3_block block 0x86f58 
[15063.024466] XFS (loop0): Unmount and run xfs_repair
[15063.024468] XFS (loop0): First 128 bytes of corrupted metadata buffer:
[15063.024471] 00000000: 58 44 42 33 9f ab d7 f4 00 00 00 00 00 08 6f 58  XDB3..........oX
[15063.024474] 00000010: 00 00 00 0f 00 00 02 5c 53 5a 35 23 57 2c 4c f1  .......\SZ5#W,L.
                         CYCLE       BLOCK

Yup, there we go. The LSN is (0xf,0x25c), which tells log recovery
not to recover it because it's the same as the LSN as the last
journal checkpoint that records changes to the block has.

So the write that the test is tearing up is the in-place metadata
overwrite, so it's creating physical metadata corruption in the
storage. That corruption persists until all the sectors in the
metadata block have been updated, at which point your test failures
go away again.

Hence to answer your original question: Yes, XFS is behaving exactly
as it was designed to behave. The metadata verifiers have correctly
detected corruption that has resulted from the storage tearing all
it's writes to little pieces and that journal recovery couldn't
automatically repair after the fact. We failed to repair it
automatically beacuse the nature of the torn write told log recovery
"don't recover this metadata from the journal because it is already
up to date". Instead, the problem was detected on first access to
the torn up metadata, and by xfs_repair.

IOWs, there is no problems with XFS here. If there is any issue at
all, it is with the assumption that filesystems can always cleanly
recovery from massively (or randomly) torn writes. The fact is that
they can't and that's why we have things like CRCs and self
describing metadata to detect when unexpected or unrecoverable torn
or misplaced writes occur deep down in the storage layers...

Cheers,

Dave.
-- 
Dave Chinner
david@xxxxxxxxxxxxx