Re: [PATCH 4/4] xfs: fix AGF vs inode cluster buffer deadlock

Dave Chinner <david@xxxxxxxxxxxxx> · Mon, 26 Jun 2023 08:34:20 +1000

On Sun, Jun 25, 2023 at 03:58:15AM +0100, Matthew Wilcox wrote:
> On Wed, May 17, 2023 at 10:04:49AM +1000, Dave Chinner wrote:
> > Lock order in XFS is AGI -> AGF, hence for operations involving
> > inode unlinked list operations we always lock the AGI first. Inode
> > unlinked list operations operate on the inode cluster buffer,
> > so the lock order there is AGI -> inode cluster buffer.
> 
> Hi Dave,
> 
> This commit reliably produces an assertion failure for me.  I haven't
> tried to analyse why.  It's pretty clear though; I can run generic/426
> in a loop for hundreds of seconds on the parent commit (cb042117488d),
> but it'll die within 30 seconds on commit 82842fee6e59.
> 
>     export MKFS_OPTIONS="-m reflink=1,rmapbt=1 -i sparse=1 -b size=1024"

That's part of my regular regression test config (mkfs defaults w/
1kB block size), and I haven't seen this problem.

I've just kicked off an iteration of g/426 on a couple of machines,
on a current TOT, and they are already a couple of hundred
iterations in without a failure....

Can you grab a trace for me? i.e. run

# trace-cmd record -e xfs\* -e printk

in one shell and leave it running. Then in another shell run the
test. when the test fails, ctrl-c the trace-cmd, and send me
the output of

# trace-cmd report > report.txt

> I suspect the size=1024 is the important thing, but I haven't tested
> that hypothesis.  This is on an x86-64 virtual machine; full qemu
> command line at the end [1]

As it's an inode cluster buffer failure, I very much doubt
filesystem block size plays a part. Inode buffer size is defined by
inode size, not filesystem block size and so the buffer in question
will be a 16kB unmapped buffer because inode size is 512 bytes...

> 00028 FSTYP         -- xfs (debug)
> 00028 PLATFORM      -- Linux/x86_64 pepe-kvm 6.4.0-rc5-00004-g82842fee6e59 #182 SMP PREEMPT_DYNAMIC Sat Jun 24 22:51:32 EDT 2023
> 00028 MKFS_OPTIONS  -- -f -m reflink=1,rmapbt=1 -i sparse=1 -b size=1024 /dev/sdc
> 00028 MOUNT_OPTIONS -- /dev/sdc /mnt/scratch
> 00028
> 00028 XFS (sdc): Mounting V5 Filesystem 591c2048-7cce-4eda-acf7-649e19cd8554
> 00028 XFS (sdc): Ending clean mount
> 00028 XFS (sdc): Unmounting Filesystem 591c2048-7cce-4eda-acf7-649e19cd8554
> 00028 XFS (sdb): EXPERIMENTAL online scrub feature in use. Use at your own risk!
> 00028 XFS (sdb): Unmounting Filesystem 9db9e0a2-c05b-4690-a938-ae8f7b70be8e
> 00028 XFS (sdb): Mounting V5 Filesystem 9db9e0a2-c05b-4690-a938-ae8f7b70be8e
> 00028 XFS (sdb): Ending clean mount
> 00028 generic/426       run fstests generic/426 at 2023-06-25 02:52:07
> 00029 XFS: Assertion failed: bp->b_flags & XBF_DONE, file: fs/xfs/xfs_trans_buf.c, line: 241
> 00029 ------------[ cut here ]------------
> 00029 kernel BUG at fs/xfs/xfs_message.c:102!
> 00029 invalid opcode: 0000 [#1] PREEMPT SMP NOPTI
> 00029 CPU: 1 PID: 62 Comm: kworker/1:1 Kdump: loaded Not tainted 6.4.0-rc5-00004-g82842fee6e59 #182
> 00029 Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 1.16.2-debian-1.16.2-1 04/01/2014
> 00029 Workqueue: xfs-inodegc/sdb xfs_inodegc_worker
> 00029 RIP: 0010:assfail+0x30/0x40
> 00029 Code: c9 48 c7 c2 48 f8 ea 81 48 89 f1 48 89 e5 48 89 fe 48 c7 c7 b9 cc e5 81 e8 fd fd ff ff 80 3d f6 2f d3 00 00 75 04 0f 0b 5d c3 <0f> 0b 66 66 2e 0f 1f 84 00 00 00 00 00 0f 1f 00 55 48 63 f6 49 89
> 00029 RSP: 0018:ffff88800317bc78 EFLAGS: 00010202
> 00029 RAX: 00000000ffffffea RBX: ffff88800611e000 RCX: 000000007fffffff
> 00029 RDX: 0000000000000021 RSI: 0000000000000000 RDI: ffffffff81e5ccb9
> 00029 RBP: ffff88800317bc78 R08: 0000000000000000 R09: 000000000000000a
> 00029 R10: 000000000000000a R11: 0fffffffffffffff R12: ffff88800c780800
> 00029 R13: ffff88800317bce0 R14: 0000000000000001 R15: ffff88800c73d000
> 00029 FS:  0000000000000000(0000) GS:ffff88807d840000(0000) knlGS:0000000000000000
> 00029 CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
> 00029 CR2: 00005623b1911068 CR3: 000000000ee28003 CR4: 0000000000770ea0
> 00029 PKRU: 55555554
> 00029 Call Trace:
> 00029  <TASK>
> 00029  ? show_regs+0x5c/0x70
> 00029  ? die+0x32/0x90
> 00029  ? do_trap+0xbb/0xe0
> 00029  ? do_error_trap+0x67/0x90
> 00029  ? assfail+0x30/0x40
> 00029  ? exc_invalid_op+0x52/0x70
> 00029  ? assfail+0x30/0x40
> 00029  ? asm_exc_invalid_op+0x1b/0x20
> 00029  ? assfail+0x30/0x40
> 00029  ? assfail+0x23/0x40
> 00029  xfs_trans_read_buf_map+0x2d9/0x480
> 00029  xfs_imap_to_bp+0x3d/0x40
> 00029  xfs_inode_item_precommit+0x176/0x200
> 00029  xfs_trans_run_precommits+0x65/0xc0
> 00029  __xfs_trans_commit+0x3d/0x360
> 00029  xfs_trans_commit+0xb/0x10
> 00029  xfs_inactive_ifree.isra.0+0xea/0x200
> 00029  xfs_inactive+0x132/0x230
> 00029  xfs_inodegc_worker+0xb6/0x1a0
> 00029  process_one_work+0x1a9/0x3a0
> 00029  worker_thread+0x4e/0x3a0
> 00029  ? process_one_work+0x3a0/0x3a0
> 00029  kthread+0xf9/0x130
> 
> In case things have moved around since that commit, the particular line
> throwing the assertion is in this paragraph:
> 
>         if (bp) {
>                 ASSERT(xfs_buf_islocked(bp));
>                 ASSERT(bp->b_transp == tp);
>                 ASSERT(bp->b_log_item != NULL);
>                 ASSERT(!bp->b_error);
>                 ASSERT(bp->b_flags & XBF_DONE);

Nothing immediately obvious stands out here. 

I suspect it may be an interaction with memory reclaim freeing the
inode cluster buffer while it is clean after the inode has been
brought into memory, then xfs_ifree_cluster using
xfs_trans_get_buf() to invalidate the inode cluster (hence bringing
it into memory without reading it's contents) and then us trying to
read it, finding it already linked into the transaction, and it
skipping the buffer cache lookup that would have read the data
in....

The trace will tell me if this is roughly what is happening.

Cheers,

Dave.
-- 
Dave Chinner
david@xxxxxxxxxxxxx