Re: [PATCH] xfs: fix internal error from AGFL exhaustion

"Darrick J. Wong" <djwong@xxxxxxxxxx> · Wed, 25 Oct 2023 08:50:46 -0700

On Tue, Oct 24, 2023 at 10:14:07PM -0700, Omar Sandoval wrote:
> On Wed, Oct 25, 2023 at 03:35:17PM +1100, Dave Chinner wrote:
> > On Tue, Oct 24, 2023 at 04:37:33PM -0700, Omar Sandoval wrote:
> > > From: Omar Sandoval <osandov@xxxxxx>
> > > 
> > > We've been seeing XFS errors like the following:
> > > 
> > > XFS: Internal error i != 1 at line 3526 of file fs/xfs/libxfs/xfs_btree.c.  Caller xfs_btree_insert+0x1ec/0x280
> > > ...
> > > Call Trace:
> > >  xfs_corruption_error+0x94/0xa0
> > >  xfs_btree_insert+0x221/0x280
> > >  xfs_alloc_fixup_trees+0x104/0x3e0
> > >  xfs_alloc_ag_vextent_size+0x667/0x820
> > >  xfs_alloc_fix_freelist+0x5d9/0x750
> > >  xfs_free_extent_fix_freelist+0x65/0xa0
> > >  __xfs_free_extent+0x57/0x180
> > > ...
> > 
> > Good find, Omar!
> > 
> > For completeness, what's the rest of the trace? We've recently
> > changed how extent freeing path works w.r.t. busy extents so I'm
> > curious as to what the path into this code is.
> 
> This one is from my fpunch reproducer, but there were two call traces we
> actually saw in production. One from writeback:
> 
> xfs_corruption_error+0x90/0xa0
> xfs_btree_insert+0x1b1/0x1d0
> xfs_alloc_fixup_trees+0x28e/0x450
> xfs_alloc_ag_vextent_size+0x4e0/0x780
> xfs_alloc_ag_vextent+0x11c/0x140
> xfs_alloc_fix_freelist+0x3f2/0x510
> xfs_alloc_vextent+0x254/0x490
> xfs_bmap_btalloc+0x301/0x6e0
> xfs_bmapi_allocate+0x206/0x2d0
> xfs_bmapi_convert_delalloc+0x2b9/0x450
> xfs_map_blocks+0x1cc/0x470
> iomap_do_writepage+0x22b/0x860
> write_cache_pages+0x1ae/0x4c0
> iomap_writepages+0x1c/0x40
> xfs_vm_writepages+0x6b/0xa0
> do_writepages+0x6f/0x1d0
> __writeback_single_inode+0x39/0x2b0
> writeback_sb_inodes+0x277/0x510
> __writeback_inodes_wb+0x6e/0xe0
> wb_writeback+0xda/0x250
> wb_workfn+0x1f4/0x430
> process_one_work+0x12c/0x490
> worker_thread+0x143/0x3e0
> kthread+0xa3/0xd0
> ret_from_fork+0x1f/0x30
> 
> And one from inodegc:
> 
> xfs_corruption_error+0x90/0xa0
> xfs_btree_insert+0x1b1/0x1d0
> xfs_alloc_fixup_trees+0x28e/0x450
> xfs_alloc_ag_vextent_size+0x4e0/0x780
> xfs_alloc_ag_vextent+0x11c/0x140
> xfs_alloc_fix_freelist+0x3f2/0x510
> __xfs_free_extent+0xde/0x1f0
> xfs_trans_free_extent+0x54/0xd0
> xfs_extent_free_finish_item+0x69/0xa0
> xfs_defer_finish_noroll+0x163/0x510
> xfs_defer_finish+0x10/0x70
> xfs_itruncate_extents_flags+0x13a/0x250
> xfs_inactive_truncate+0xac/0xe0
> xfs_inactive+0x14f/0x160
> xfs_inodegc_worker+0x81/0x100
> process_one_work+0x12c/0x490
> worker_thread+0x143/0x3e0
> kthread+0xa3/0xd0
> ret_from_fork+0x1f/0x30
> 
> These are both from 5.19.

Hey, I've seen that backtrace before!  Only a handful of times,
fortuntely.

> > > (It's guaranteed that
> > > none of the internal nodes need to be split because they were just
> > > split.)
> > 
> > Hmmm. That doesn't sound right - btrees don't work that way.
> > 
> > We don't know what path the tree split along in the previous insert
> > operation  - a split only guarantees the next insert along that same
> > path won't split if the insert hits one of the two leaf blocks from
> > the split.  Insert into a different leaf can still split interior
> > nodes all the way back up the tree the point where it intersects
> > with the previous full height split path.
> > 
> > e.g. if the previous full height split was down the left side of the
> > tree and the next insert is in the right half of the tree, the right
> > half can split on insert all the way up to the level below the old
> > root.
> 
> You're right, I was only thinking about the 2->3 level case. My
> assumption is only true in that case.
> 
> > > Fix it by adding an extra block of slack in the freelist for the
> > > potential leaf split in each of the bnobt and cntbt.

I see how the cntbt can split because changing the length of a freespace
extent can require the record to move to a totally different part of the
cntbt.  The reinsertion causes the split.

How does the bnobt split during a refresh of the AGFL?  Will we ever
allocate a block from the /middle/ of a free space extent?

> > 
> > Hmmm. yeah - example given is 2->3 level split, hence only need to
> > handle a single level leaf split before path intersection occurs.
> > Yup, adding an extra block will make the exact problem being seen go
> > away.
> > 
> > Ok, what's the general solution? 4-level, 5-level or larger trees?
> > 
> > Worst split after a full split is up to the level below the old root
> > block. the root block was split, so it won't need a split again, so
> > only it's children could split again. OK, so that's (height - 1)
> > needed for a max split to refill the AGFL after a full height split
> > occurred.
> > 
> > Hence it looks like the minimum AGFL space for any given
> > allocation/free operation needs to be:
> > 
> > 	(full height split reservation) + (AGFL refill split height)
> > 
> > which is:
> > 
> > 	= (new height) + (new height - 2)
> > 	= 2 * new height - 2
> > 	= 2 * (current height + 1) - 2
> > 	= 2 * current height
> > 
> > Ok, so we essentially have to double the AGFL minimum size to handle
> > the generic case of AGFL refill splitting free space trees after a
> > transaction that has exhausted it's AGFL reservation.
> > 
> > Now, did I get that right?
> 
> That sounds right, but I'll have to think about it more after some sleep
> :)

I think that's correct, assuming that (2 * current_height) is the
per-btree calcluation.

> Assuming that is correct, your LE search suggestion sounds kind of nice
> rather than such a drastic change to the AGFL size.

The absolute maximum height of a free space btree is 7 blocks, according
to xfs_db:

# xfs_db /dev/sda -c 'btheight -w absmax all'
bnobt: 7
cntbt: 7
inobt: 6
finobt: 6
bmapbt: 14
refcountbt: 6
rmapbt: 10

The smallest AGFL is 62 records long (V4 fs, 512b blocks) so I don't
think it's /that/ drastic.  For a filesystem with rmapbt (V5, 1k blocks)
that minimum jumps to 121 blocks.

> > The rmapbt case will need this change, too, because rmapbt blocks
> > are allocated from the free list and so an rmapbt update can exhaust
> > the free list completely, too.
> 
> Ah, I missed where the rmapbt is updated since it was slightly removed
> from the xfs_alloc_fixup_trees() code path I was looking at.

The rmapbt has its own accounting tricks (XFS_AG_RESV_RMAPBT) to ensure
that there's always enough free space to refill the AGFL.  Is that why
the testcase turns off rmapbt?

--D

> Thanks for the quick review!
> 
> Omar