Re: xfs: system fails to boot up due to Internal error xfs_trans_cancel

Dave Chinner <david@xxxxxxxxxxxxx> · Sat, 18 Feb 2023 07:25:35 +1100

On Fri, Feb 17, 2023 at 08:53:32AM -0800, Darrick J. Wong wrote:
> On Fri, Feb 17, 2023 at 04:45:12PM +0530, shrikanth hegde wrote:
> > We are observing panic on boot upon loading the latest stable tree(v6.2-rc4) in 
> > one of our systems. System fails to come up. System was booting well 
> > with v5.17, v5.19 kernel. We started seeing this issue when loading v6.0 kernel.
> > 
> > Panic Log is below.
> > [  333.390539] ------------[ cut here ]------------
> > [  333.390552] WARNING: CPU: 56 PID: 12450 at fs/xfs/xfs_inode.c:1839 xfs_iunlink_lookup+0x58/0x80 [xfs]
> 
> Hmm, ok, so this is the same if (WARN_ON_ONCE(!ip || !ip->i_ino)) line
> in xfs_iunlink_lookup that I've been bonking my head on the past
> several days.  333 seconds uptime, so I guess this is a pretty recent
> mount.  You didn't post a full dmesg, so I can only assume there weren't
> any *other* obvious complaints from XFS when the fs was mounted...
> 
> > [  333.390615] Modules linked in: nft_fib_inet nft_fib_ipv4 nft_fib_ipv6 nft_fib nft_reject_inet nf_reject_ipv4 nf_reject_ipv6 nft_reject nft_ct nft_chain_nat nf_nat nf_conntrack nf_defrag_ipv6 nf_defrag_ipv4 ip_set nf_tables nfnetlink rfkill sunrpc pseries_rng xts vmx_crypto xfs libcrc32c sd_mod sg ibmvscsi ibmveth scsi_transport_srp nvme nvme_core t10_pi crc64_rocksoft crc64 dm_mirror dm_region_hash dm_log dm_mod
> > [  333.390645] CPU: 56 PID: 12450 Comm: rm Not tainted 6.2.0-rc4ssh+ #4
> > [  333.390649] Hardware name: IBM,9080-HEX POWER10 (raw) 0x800200 0xf000006 of:IBM,FW1010.22 (NH1010_122) hv:phyp pSeries
> > [  333.390652] NIP:  c0080000004bfa80 LR: c0080000004bfa4c CTR: c000000000ea28d0
> > [  333.390655] REGS: c0000000442bb8c0 TRAP: 0700   Not tainted  (6.2.0-rc4ssh+)
> > [  333.390658] MSR:  8000000000029033 <SF,EE,ME,IR,DR,RI,LE>  CR: 24002842  XER: 00000000
> > [  333.390666] CFAR: c0080000004bfa54 IRQMASK: 0
> > [  333.390666] GPR00: c00000003b69c0c8 c0000000442bbb60 c008000000568300 0000000000000000
> > [  333.390666] GPR04: 00000000002ec44d 0000000000000000 0000000000000000 c000000004b27d78
> > [  333.390666] GPR08: 0000000000000000 c000000004b27e28 0000000000000000 fffffffffffffffd
> > [  333.390666] GPR12: 0000000000000040 c000004afecc5880 0000000106620918 0000000000000001
> > [  333.390666] GPR16: 000000010bd36e10 0000000106620dc8 0000000106620e58 0000000106620e90
> > [  333.390666] GPR20: 0000000106620e30 c0000000880ba938 0000000000200000 00000000002ec44d
> > [  333.390666] GPR24: 000000000008170d 000000000000000d c0000000519f4800 00000000002ec44d
> > [  333.390666] GPR28: c0000000880ba800 c00000003b69c000 c0000000833edd20 000000000008170d
> > [  333.390702] NIP [c0080000004bfa80] xfs_iunlink_lookup+0x58/0x80 [xfs]
> > [  333.390756] LR [c0080000004bfa4c] xfs_iunlink_lookup+0x24/0x80 [xfs]
> > [  333.390810] Call Trace:
> > [  333.390811] [c0000000442bbb60] [c0000000833edd20] 0xc0000000833edd20 (unreliable)
> > [  333.390816] [c0000000442bbb80] [c0080000004c0094] xfs_iunlink+0x1bc/0x280 [xfs]
> > [  333.390869] [c0000000442bbc00] [c0080000004c3f84] xfs_remove+0x1dc/0x310 [xfs]
> > [  333.390922] [c0000000442bbc70] [c0080000004be180] xfs_vn_unlink+0x68/0xf0 [xfs]
> > [  333.390975] [c0000000442bbcd0] [c000000000576b24] vfs_unlink+0x1b4/0x3d0
> 
> ...that trips when rm tries to remove a file, which means that the call
> stack is
> 
> xfs_remove -> xfs_iunlink -> xfs_iunlink_insert_inode ->
> xfs_iunlink_update_backref -> xfs_iunlink_lookup <kaboom>
> 
> It looks as though "rm foo" unlinked foo from the directory and was
> trying to insert it at the head of one of the unlinked lists in the AGI
> buffer.  The AGI claims that the list points to an ondisk inode, so the
> iunlink code tries to find the incore inode to update the incore list,
> fails to find an incore inode, and this is the result...

This implies that unlinked inode recovery failed for some reason,
and we didn't clear the unlinked list from the AGI properly.

> > we did a git bisect between 5.17 and 6.0. Bisect points to commit 04755d2e5821 
> > as the bad commit.
> > Short description of commit:
> > commit 04755d2e5821b3afbaadd09fe5df58d04de36484 (refs/bisect/bad)
> > Author: Dave Chinner <dchinner@xxxxxxxxxx>
> > Date:   Thu Jul 14 11:42:39 2022 +1000
> > 
> >     xfs: refactor xlog_recover_process_iunlinks()
> 
> ...which was in the middle of the series that reworked thev mount time
> iunlink clearing.  Oddly, I don't spot any obvious errors in /that/
> patch that didn't already exist.  But this does make me wonder, does
> xfs_repair -n have anything to say about unlinked or orphaned inodes?
> 
> The runtime code expects that every ondisk inode in an iunlink chain has
> an incore inode that is linked (via i_{next,prev}_unlinked) to the other
> incore inodes in that same chain.  If this requirement is not met, then
> the WARNings you see will trip, and the fs shuts down.
> 
> My hypothesis here is that one of the AGs has an unprocessed unlinked
> list.  At mount time, the ondisk log was clean, so mount time log
> recovery didn't invoke xlog_recover_process_iunlinks, and the list was
> not cleared.  The mount code does not construct the incore unlinked list
> from an existing ondisk iunlink list, hence the WARNing.  Prior to 5.17,
> we only manipulated the ondisk unlink list, and the code never noticed
> or cared if there were mystery inodes in the list that never went away.

Yup, that's what I'm thinking.

But, hmmm, why is xlog_recover_clear_agi_bucket() not actually
clearing the bucket head at that point?

Oh, because we don't (and never have) capture errors from inode
inactivation?  Or maybe the filesystem has shut down due to the
unlink failure?

I'd love to know how the filesystem got to this state in the first
place, but at the moment I'll settle for a metadata of a broken
filesystem....

> (Obviously, if something blew up earlier in dmesg, that would be
> relevant here.)
> 
> It's possible that we could end up in this situation (clean log,
> unlinked inodes) if a previous log recovery was only partially
> successful at clearing the unlinked list, since all that code ignores
> errors.  If that happens, we ... succeed at mounting and clean the log.
> 
> If you're willing to patch your kernels, it would be interesting
> to printk if the xfs_read_agi or the xlog_recover_iunlink_bucket calls
> in xlog_recover_iunlink_ag returns an error code.  It might be too late
> to capture that, hence my suggestion of seeing if xfs_repair -n will
> tell us anything else.
> 
> I've long thought that the iunlink recovery ought to complain loudly and
> fail the mount if it can't clear all the unlinked files.  Given the new
> iunlink design, I think it's pretty much required now.  The uglier piece
> is that now we either (a) have to clear iunlinks at mount time
> unconditionally as Eric has been saying for years; or (b) construct the
> incore list at a convenient time so that the incore list always exists.

Recovery does actually construct the incore list sufficient to
unlink it (i.e. that's what the next/prev reference looping does).
We could walk the entire list first and bring it into memory,
but the problem there, I think, is that we have to hold references
to all the inodes we bring into memory during recovery otherwise
they are immediately reclaimed once we drop the reference.

I suspect we need to work out why xlog_recover_clear_agi_bucket() is
not triggering and clearing the bucket on error, because that is
supposed to avoid this exact "could not process all unlinked inodes"
and it warns loudly when it triggers....

-Dave.
-- 
Dave Chinner
david@xxxxxxxxxxxxx