Re: [PATCH 0/2] xfs: more shutdown/recovery fixes

"Darrick J. Wong" <djwong@xxxxxxxxxx> · Tue, 22 Mar 2022 09:23:29 -0700

On Tue, Mar 22, 2022 at 02:26:49PM +1100, Dave Chinner wrote:
> On Mon, Mar 21, 2022 at 07:35:28PM -0700, Darrick J. Wong wrote:
> > On Mon, Mar 21, 2022 at 12:23:27PM +1100, Dave Chinner wrote:
> > > Hi folks,
> > > 
> > > These two patches are followups to my recent series of
> > > shutdown/recovery fixes. The cluster buffer lock patch addresses a
> > > race condition that started to show up regularly once the fixes in
> > > the previous series were done - it is a regression from the async
> > > inode reclaim work that was done almost 2 years ago now.
> > > 
> > > The second patch is something I'm really surprised has taken this
> > > long to uncover. There is a check in intent recovery/cancellation
> > > that checks that there are no intent items in the AIL after the
> > > first non-intent item is found. This behaviour was correct back when
> > > we only had standalone intent items (i.e. EFI/EFD), but when we
> > > started to chain complex operations by intents, the recovery of an
> > > incomplete intent can log and commit new intents and they can end up
> > > in the AIL before log recovery is complete and finished processing
> > > the deferred items. Hence the ASSERT() check that no intents
> > > exist in the AIL after the first non-intent item is simply invalid.
> > > 
> > > With these two patches, I'm back to being able to run hundreds of
> > > cycles of g/388 or -g recoveryloop without seeing any failures.
> > 
> > Hmm, with this series applied to current for-next
> > + your other log patches
> > + my xfs_reserve_blocks infinite loop fix and cleanup series
> > 
> > I see this while running fstests on an all-defaults v5 filesystem.
> > I /think/ it's crashing in the "xfs_buf_ispinned(bp)" right below the
> > "ASSERT(iip->ili_item.li_buf);" in xfs_inode_item_push.
> 
> Oh, because you have CONFIG_XFS_ASSERT_FATAL=n, or whatever the
> config thingy is to keep going even when asserts fail...

Yes, because I often have ftrace turned on when I'm debugging things,
and it's /much/ easier to let trace-cmd continue to slurp data out of
the ring buffer (even if I have to manually kill the fstest) than dig
through 10000+ lines of serial console log after the fact.

I always enable the printk tracepoint so the traces and the console
output stay in sync too.

> > [ 1294.636121] run fstests generic/251 at 2022-03-21 16:06:27
> > [ 1295.168287] XFS (sda): Mounting V5 Filesystem
> > [ 1295.180601] XFS (sda): Ending clean mount
> > [ 1295.471352] XFS (sdf): Mounting V5 Filesystem
> > [ 1295.482082] XFS (sdf): Ending clean mount
> > [ 1295.483612] XFS (sdf): Quotacheck needed: Please wait.
> > [ 1295.487265] XFS (sdf): Quotacheck: Done.
> > [ 1376.010962] XFS (sdf): xlog_verify_grant_tail: space > BBTOB(tail_blocks)
> > [ 1386.006736] XFS: Assertion failed: iip->ili_item.li_buf, file: fs/xfs/xfs_inode_item.c, line: 547
> > [ 1386.007571] ------------[ cut here ]------------
> > [ 1386.007991] WARNING: CPU: 0 PID: 419577 at fs/xfs/xfs_message.c:112 assfail+0x3c/0x40 [xfs]
> > [ 1386.008837] Modules linked in: btrfs blake2b_generic xor lzo_compress lzo_decompress zlib_deflate raid6_pq zstd_compress ext4 mbcache jbd2 dm_flakey dm_snapshot dm_bufio dm_zero xfs libcrc32c ip6t_REJECT nf_reject_ipv6 ipt_REJECT nf_reject_ipv4 xt_tcpudp ip_set_hash_ip xt_REDIRECT xt_set ip_set_hash_net ip_set_hash_mac ip_set iptable_nat nf_nat nf_conntrack nfnetlink nf_defrag_ipv6 nf_defrag_ipv4 ip6table_filter ip6_tables iptable_filter bfq pvpanic_mmio pvpanic sch_fq_codel ip_tables x_tables overlay nfsv4 af_packet
> > [ 1386.012759] CPU: 0 PID: 419577 Comm: xfsaild/sdf Not tainted 5.17.0-rc6-djwx #rc6 38f9b0b98565b531708bd3bd34525d3eec280440
> > [ 1386.013718] Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 1.13.0-1ubuntu1.1 04/01/2014
> > [ 1386.014493] RIP: 0010:assfail+0x3c/0x40 [xfs]
> > [ 1386.015098] Code: 90 9c 45 a0 e8 81 f9 ff ff 8a 1d d3 28 0c 00 80 fb 01 76 0f 0f b6 f3 48 c7 c7 60 81 4f a0 e8 b3 4d fd e0 80 e3 01 74 02 0f 0b <0f> 0b 5b c3 48 8d 45 10 48 89 e2 4c 89 e6 48 89 1c 24 48 89 44 24
> > [ 1386.016706] RSP: 0018:ffffc9000530be18 EFLAGS: 00010246
> > [ 1386.017172] RAX: 0000000000000000 RBX: ffff888063ace300 RCX: 0000000000000000
> > [ 1386.017801] RDX: 00000000ffffffc0 RSI: 0000000000000000 RDI: ffffffffa0449eb0
> > [ 1386.018435] RBP: ffff88805fa897c0 R08: 0000000000000000 R09: 000000000000000a
> > [ 1386.019059] R10: 000000000000000a R11: f000000000000000 R12: ffff88800ad06f40
> > [ 1386.019685] R13: ffff88800ad06f68 R14: 00000016000191e9 R15: 0000000000000000
> > [ 1386.020395] FS:  0000000000000000(0000) GS:ffff88803ec00000(0000) knlGS:0000000000000000
> > [ 1386.021172] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
> > [ 1386.021680] CR2: 0000564ba4fc9968 CR3: 000000002852c002 CR4: 00000000001706b0
> > [ 1386.022313] Call Trace:
> > [ 1386.022555]  <TASK>
> > [ 1386.022763]  xfs_inode_item_push+0xf4/0x140 [xfs 9a019b85afd13531ec29ae154df8383f289b73eb]
> > [ 1386.023579]  xfsaild+0x402/0xc30 [xfs 9a019b85afd13531ec29ae154df8383f289b73eb]
> > [ 1386.024367]  ? __set_cpus_allowed_ptr_locked+0xe0/0x1a0
> > [ 1386.024841]  ? preempt_count_add+0x73/0xa0
> > [ 1386.025255]  ? xfs_trans_ail_cursor_first+0x80/0x80 [xfs 9a019b85afd13531ec29ae154df8383f289b73eb]
> > [ 1386.026163]  kthread+0xea/0x110
> > [ 1386.026487]  ? kthread_complete_and_exit+0x20/0x20
> > [ 1386.026924]  ret_from_fork+0x1f/0x30
> > [ 1386.027261]  </TASK>
> 
> I haven't seen that particular failure - that's likely an
> interaction with stale indoes and buffers on inode cluster buffer
> freeing. On the surface it doesn't seem to be related to the
> shutdown changes, but it could be related to the fact that we
> are accessing the inode log item without holding either the buffer
> locked or the item locked.
> 
> Hence this could be racing with an journal IO completion for a freed
> inode cluster buffer that runs inode IO compeltion directly, which
> then calls xfs_iflush_abort() because XFS_ISTALE is set on all the
> inodes.
> 
> Yeah.
> 
> I changed xfs_iflush_abort() to clean up the inode log item state
> before I removed it from the AIL, thinking that this was safe to
> do because anything that uses the iip->ili_item.li_buf should be
> checking it before using it.....
> 
> .... and that is exactly what xfs_inode_item_push() is _not_ doing.

Aha, that's sort of what I was guessing.

> OK, I know how this is happening, I'll have to rework the first
> patch in this series to address this.

Cool. :)

--D

> 
> Cheers,
> 
> Dave.
> 
> -- 
> Dave Chinner
> david@xxxxxxxxxxxxx