[PATCH 0/5] xfs: fix discontiguous metadata block recovery

Dave Chinner <david@xxxxxxxxxxxxx> · Tue, 19 Mar 2024 13:15:19 +1100

Recently Zorro tripped over a failure with 64kB directory blocks on
s390x via generic/648. Recovery was reporting failures like this:

 XFS (loop3): Mounting V5 Filesystem c1954438-a18d-4b4a-ad32-0e29c40713ed
 XFS (loop3): Starting recovery (logdev: internal)
 XFS (loop3): Bad dir block magic!
 XFS: Assertion failed: 0, file: fs/xfs/xfs_buf_item_recover.c, line: 414
 ....

Or it was succeeding and later operations were detecting directory
block corruption during idrectory operations such as:

 XFS (loop3): Metadata corruption detected at __xfs_dir3_data_check+0x372/0x6c0 [xfs], xfs_dir3_block block 0x1020
 XFS (loop3): Unmount and run xfs_repair
 XFS (loop3): First 128 bytes of corrupted metadata buffer:
 00000000: 58 44 42 33 00 00 00 00 00 00 00 00 00 00 10 20  XDB3...........
 ....

Futher triage and diagnosis pointed to the fact that the test was
generating a discontiguous (multi-extent) directory block and that
directory block was not being recovered correctly when it was
encountered.

Zorro captured a trace, and what we saw in the trace was a specific
pattern of buffer log items being processed through every phase of
recovery:

 xfs_log_recover_buf_not_cancel: dev 7:0 daddr 0x2c2ce0, bbcount 0x10, flags 0x5000, size 2, map_size 2
 xfs_log_recover_item_recover: dev 7:0 tid 0xce3ce480 lsn 0x300014178, pass 1, item 0x8ea70fc0, item type XFS_LI_BUF item region count/total 2/2
 xfs_log_recover_buf_not_cancel: dev 7:0 daddr 0x331fb0, bbcount 0x58, flags 0x5000, size 2, map_size 11
 xfs_log_recover_item_recover: dev 7:0 tid 0xce3ce480 lsn 0x300014178, pass 1, item 0x8f36c040, item type XFS_LI_BUF item region count/total 2/2

The item addresses, tid and LSN change, but the order of the two
buf log items does not.

These are both "flags 0x5000" which means both log items are
XFS_BLFT_DIR_BLOCK_BUF types, and they are both partial directory
block buffers, and they are discontiguous. They also have different
types of log items both before and after them, so it is likely these
are two extents within the same compound buffer.

The way we log compound buffers is that we create a buf log format
item for each extent in the buffer, and then we log each range as a
separate buf log format item. IOWs, to recovery these fragments of
the directory block appear just like complete regular buffers that
need to be recovered.

Hence what we see above is the first buffer (daddr 0x2c2ce0, bbcount
0x10) is the first extent in the directory block that contains the
header and magic number, so it recovers and verifies just fine. The
second buffer is the tail of the directory block, and it does not
contain a magic number because it starts mid-way through the
directory block. Hence the magic number verification fails and the
buffer is not recovered.

Compound buffers were logged this way so that they didn't require a
change of log format to recover. Prior to compound buffers, the
directory code kept it's own dabuf structure to map multiple extents
in a single directory block, and they got logged as separate buffer
log format items as well.

So the problem isn't directly related to the conversion of dabufs to
compound buffers - the problem is related to the buffer recovery
verification code not knowing that directory buffer fragments are
valid recovery targets.

Hence the fixes in this patchset are to log recovery, and do not
change runtime behaviour at all. The first thing we do is change the
buffer recovery code to consider a type mismatch between the BLF and
the buffer contents as a fatal error instead of a warning. If we
just warn and continue, the recovered metadata may still be corrupt
and so we should just abort with -EFSCORRUPTED when this occurs.
That addresses the silent recovery success followed by runtime
detection of directory corruption issue that was encountered.

We then need to untangle the buffer recovery code a bit. Inode
buffer, dquot buffer and regular buffer recovery are all a bit
different, but they are tightly intertwined. neither dquot nor inode
buffer recovery need discontiguous buffer recovery detection, and
they also have different constraints so separate them out. We also
always recover inode and dquot buffers, so we don't need check
magic numbers or decode internal lsns to determine if they should be
recovered.

With that done, we can then add code to the general buffer recovery
to detect partial block recovery situations. We check the BLF type
to determine if it is a directory buffer, and add a path for
recovery of partial directory block items. This allows recovery of
regions of directory blocks that do not start at offset 0 in the
directory block. This fixes the initial "bad dir block magic" issue
reported, and results in correct recovery of discontiguous directory
blocks.

IOWs, this appears to be a log recovery problem and not a runtime
issue. I think the fix will be to allow directory blocks to fail the
magic number check if and only if the buffer length does not match
the directory block size (i.e. it is a partial directory block
fragment being recovered).

This passes repeated looping over '-g enospc -g recoveryloop' on
64kb directory block size configurations, so the change to recovery
hasn't caused any obvious regressions in fixing this issue.

Thoughts?