Re: [xfstests generic/648] 64k directory block size (-n size=65536) crash on _xfs_buf_ioapply

Dave Chinner <david@xxxxxxxxxxxxx> · Tue, 23 Jan 2024 09:49:37 +1100

On Mon, Jan 22, 2024 at 09:18:56PM +0800, Zorro Lang wrote:
> # dmesg
> [180724.293443] loop: module loaded
> [180724.294001] loop0: detected capacity change from 0 to 6876344
> [180724.296987] XFS (loop0): Mounting V5 Filesystem 59e2f6ae-ceab-4232-9531-a85417847238
> [180724.309088] XFS (loop0): Starting recovery (logdev: internal)
> [180724.335207] XFS (loop0): Bad dir block magic!
> [180724.335210] XFS: Assertion failed: 0, file: fs/xfs/xfs_buf_item_recover.c, line: 414
> [180724.335264] ------------[ cut here ]------------
> [180724.335265] kernel BUG at fs/xfs/xfs_message.c:102!
> [180724.335356] monitor event: 0040 ilc:2 [#1] SMP 
> [180724.335362] Modules linked in: loop sunrpc rfkill vfio_ccw mdev vfio_iommu_type1 zcrypt_cex4 vfio iommufd drm fuse i2c_core drm_panel_orientation_quirks xfs libcrc32c ghash_s390 prng virt
> io_net des_s390 sha3_512_s390 net_failover sha3_256_s390 failover virtio_blk dm_mirror dm_region_hash dm_log dm_mod pkey zcrypt aes_s390
> [180724.335379] CPU: 2 PID: 6449 Comm: mount Kdump: loaded Not tainted 6.7.0+ #1
> [180724.335382] Hardware name: IBM 3931 LA1 400 (KVM/Linux)
> [180724.335384] Krnl PSW : 0704e00180000000 000003ff7fe692ca (assfail+0x62/0x68 [xfs])
> [180724.335727]            R:0 T:1 IO:1 EX:1 Key:0 M:1 W:0 P:0 AS:3 CC:2 PM:0 RI:0 EA:3
> [180724.335729] Krnl GPRS: c000000000000021 000003ff7fee3a40 ffffffffffffffea 000000000000000a
> [180724.335731]            000003800005b3c0 0000000000000000 000003ff800004c4 0000000300014178
> [180724.335732]            0000000090a87e80 0000000300014178 000000008bcf6000 00000000924a5000
> [180724.335734]            000003ffbe72ef68 000003ff7ffe4c20 000003ff7fe692a8 000003800005b468
> [180724.335742] Krnl Code: 000003ff7fe692bc: f0a8000407fe       srp     4(11,%r0),2046,8
>                            000003ff7fe692c2: 47000700           bc      0,1792
>                           #000003ff7fe692c6: af000000           mc      0,0
>                           >000003ff7fe692ca: 0707               bcr     0,%r7
>                            000003ff7fe692cc: 0707               bcr     0,%r7
>                            000003ff7fe692ce: 0707               bcr     0,%r7
>                            000003ff7fe692d0: c00400133e0c       brcl    0,000003ff800d0ee8
>                            000003ff7fe692d6: eb6ff0480024       stmg    %r6,%r15,72(%r15)
> [180724.335753] Call Trace:
> [180724.335754]  [<000003ff7fe692ca>] assfail+0x62/0x68 [xfs] 
> [180724.335835] ([<000003ff7fe692a8>] assfail+0x40/0x68 [xfs])
> [180724.335915]  [<000003ff7fe8323e>] xlog_recover_validate_buf_type+0x2a6/0x5c8 [xfs] 
> [180724.335997]  [<000003ff7fe845ba>] xlog_recover_buf_commit_pass2+0x382/0x448 [xfs] 
> [180724.336078]  [<000003ff7fe8e89a>] xlog_recover_items_pass2+0x72/0xf0 [xfs] 
> [180724.336159]  [<000003ff7fe8f7ce>] xlog_recover_commit_trans+0x39e/0x3c0 [xfs] 
> [180724.336240]  [<000003ff7fe8f930>] xlog_recovery_process_trans+0x140/0x148 [xfs] 
> [180724.336321]  [<000003ff7fe8f9f8>] xlog_recover_process_ophdr+0xc0/0x180 [xfs] 
> [180724.336402]  [<000003ff7fe9002e>] xlog_recover_process_data+0xb6/0x168 [xfs] 
> [180724.336482]  [<000003ff7fe901e4>] xlog_recover_process+0x104/0x150 [xfs] 
> [180724.336563]  [<000003ff7fe905e2>] xlog_do_recovery_pass+0x3b2/0x748 [xfs] 
> [180724.336643]  [<000003ff7fe90dd0>] xlog_do_log_recovery+0x88/0xd8 [xfs] 
> [180724.336727]  [<000003ff7fe90e6c>] xlog_do_recover+0x4c/0x218 [xfs] 
> [180724.336808]  [<000003ff7fe9247a>] xlog_recover+0xda/0x1a0 [xfs] 
> [180724.336888]  [<000003ff7fe78d36>] xfs_log_mount+0x11e/0x280 [xfs] 
> [180724.336967]  [<000003ff7fe6a756>] xfs_mountfs+0x3e6/0x920 [xfs] 
> [180724.337047]  [<000003ff7fe71ffc>] xfs_fs_fill_super+0x40c/0x7d8 [xfs] 
> [180724.337127]  [<00000000552adf88>] get_tree_bdev+0x120/0x1a8 
> [180724.337142]  [<00000000552ab690>] vfs_get_tree+0x38/0x110 
> [180724.337146]  [<00000000552dee28>] do_new_mount+0x188/0x2e0 
> [180724.337150]  [<00000000552dfaa4>] path_mount+0x1ac/0x818 
> [180724.337153]  [<00000000552e0214>] __s390x_sys_mount+0x104/0x148 
> [180724.337156]  [<0000000055934796>] __do_syscall+0x21e/0x2b0 
> [180724.337163]  [<0000000055944d60>] system_call+0x70/0x98 
> [180724.337170] Last Breaking-Event-Address:
> [180724.337221]  [<000003ff7fe692b2>] assfail+0x4a/0x68 [xfs]
> [180724.337301] ---[ end trace 0000000000000000 ]---
> 
> # trace-cmd report > testfs.trace.txt
> # bzip2 testfs.trace.txt
> 
> Please download it from:
> https://drive.google.com/file/d/1FgpPidbMZHSjZinyc_WbVGfvwp2btA86/view?usp=sharing

Actually, looking at the compound buffer logging code and reminding
myself of how it works, I think this might actually be a false
positive.

What I see in the trace is this pattern of buffer log items being
processed through every phase of recovery:

 xfs_log_recover_buf_not_cancel: dev 7:0 daddr 0x2c2ce0, bbcount 0x10, flags 0x5000, size 2, map_size 2
 xfs_log_recover_item_recover: dev 7:0 tid 0xce3ce480 lsn 0x300014178, pass 1, item 0x8ea70fc0, item type XFS_LI_BUF item region count/total 2/2
 xfs_log_recover_buf_not_cancel: dev 7:0 daddr 0x331fb0, bbcount 0x58, flags 0x5000, size 2, map_size 11
 xfs_log_recover_item_recover: dev 7:0 tid 0xce3ce480 lsn 0x300014178, pass 1, item 0x8f36c040, item type XFS_LI_BUF item region count/total 2/2

The item addresses, tid and LSN change, but the order of the two
buf log items does not.

These are both "flags 0x5000" which means both log items are
XFS_BLFT_DIR_BLOCK_BUF types, and they are both partial directory
block buffers, and they are discontiguous. They also have different
types of log items both before and after them, so it is likely these
are two extents within the same compound buffer.

The way we log compound buffers is that we create a buf log format
item for each extent in the buffer, and then we log each range as a
separate buf log format item. IOWs, to recovery these fragments of
the directory block appear just like complete regular buffers that
need to be recovered.

Hence what we see above is the first buffer (daddr 0x2c2ce0, bbcount
0x10) is the first extent in the directory block that contains the
header and magic number, so it recovers and verifies just fine. The
second buffer is the tail of the directory block, and it does not
contain a magic number because it starts mid-way through the
directory block. Hence the magic number verification fails and the
buffer is not recovered.

Compound buffers were logged this way so that they didn't require a
change of log format to recover. Prior to compound buffers, the
directory code kept it's own dabuf structure to map multiple extents
in a singel directory block, and they got logged as separate buffer
log format items as well.

So the problem isn't directly related to the conversion of dabufs to
compound buffers - the problem is related to the buffer recovery
verification code not knowing that directory buffer fragments are
valid recovery targets.

IOWs, this appears to be a log recovery problem and not a runtime
issue. I think the fix will be to allow directory blocks to fail the
magic number check if and only if the buffer length does not match
the directory block size (i.e. it is a partial directory block
fragment being recovered). 

This makes the verification a bit more tricky and perhaps a little
less robust, but we do know that we are supposed to be recovering a
directory block from the buf log format flags and so this additional
check can be constrained.

I suspect that we also need to verify that all other buffers that
are supposed to log and recover entire objects have the correct
size, too.

I'll start working on a fix now - I'm not sure what form that will
take yet as I need to stare at the code for a while longer yet.

Zorro, in the mean time, can you write up an xfstest that creates a
small XFS filesystem with "-n size=64k" and a large log, sets it up
with single block fragmentation (falloc, punch alternate), then
creates a bunch of files (a few thousand) to create a set of
fragmented directory blocks, then runs 'shutdown -f' to force the
log and prevent metadata writeback, then unmounts and mounts the
filesystem. The mount of the filesystem should then trigger this
directory fragment recovery issue on any platform, not just s390.

Cheers,

Dave.
-- 
Dave Chinner
david@xxxxxxxxxxxxx