Re: [PATCH] xfs: fix internal error from AGFL exhaustion

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



On Tue, Oct 24, 2023 at 04:37:33PM -0700, Omar Sandoval wrote:
> From: Omar Sandoval <osandov@xxxxxx>
> 
> We've been seeing XFS errors like the following:
> 
> XFS: Internal error i != 1 at line 3526 of file fs/xfs/libxfs/xfs_btree.c.  Caller xfs_btree_insert+0x1ec/0x280
> ...
> Call Trace:
>  xfs_corruption_error+0x94/0xa0
>  xfs_btree_insert+0x221/0x280
>  xfs_alloc_fixup_trees+0x104/0x3e0
>  xfs_alloc_ag_vextent_size+0x667/0x820
>  xfs_alloc_fix_freelist+0x5d9/0x750
>  xfs_free_extent_fix_freelist+0x65/0xa0
>  __xfs_free_extent+0x57/0x180
> ...

Good find, Omar!

For completeness, what's the rest of the trace? We've recently
changed how extent freeing path works w.r.t. busy extents so I'm
curious as to what the path into this code is.

Lots of what follows is really notes as I try to pull this apart and
understand it. Please check my working!

> 
> This is the XFS_IS_CORRUPT() check in xfs_btree_insert() when
> xfs_btree_insrec() fails.
> 
> After converting this into a panic and dissecting the core dump, I found
> that xfs_btree_insrec() is failing because it's trying to split a leaf
> node in the cntbt when the AG free list is empty. In particular, it's
> failing to get a block from the AGFL _while trying to refill the AGFL_.
> Our filesystems don't have the rmapbt enabled, so if both the bnobt and
> cntbt are 2 levels, the free list is 6 blocks.
>
> If a single operation
> causes both the bnobt and cntbt to split from 2 levels to 3 levels, this
> will allocate 6 new blocks and exhaust the free list.

Which is just fine - the AGFL was sized correctly for that specific
operation to allow it to succeed.

> Then, when the
> next operation tries to refill the freelist, it allocates space.

Ok, so the initial condition for this allocation is a completely
empty AGFL.

> If the
> allocation does not use a full extent, it will need to insert records
> for the remaining space in the bnobt and cntbt. And if those new records
> go in full leaves, the leaves need to be split.

Ok, so we've gone to do a by-size allocation, which will start the
search at:

	if ((error = xfs_alloc_lookup_ge(cnt_cur, 0,
                        args->maxlen + args->alignment - 1, &i)))

an extent that covers maxlen + alignment. The allocation parameters
set by the freelist fixup will be:

	targs.alignment = targs.minlen = targs.prod = 1;
...
	targs.maxlen = need - pag->pagf_flcount;

Ok, so the by-size allocation will select the first extent the same
size of larger than what the AGFL needs to fill the empty slots.

Oversize extents get selected because the allocation doesn't find an
extent of the exact size requested and so pulls a larger extent of
the free space tree. If that extent is not suitable (e.g. it is
busy), it then keeps walking in the direction of larger extents to
carve the free list out of.

IOWs, this search direction guarantees that we'll have to split
the free space extent if there is no free extent the exact size the
AGFL needs.

Ok, that's a possible avenue for fixing the issue - when the agfl is
completely empty, do a LE search so we never have to split a free
space extent and hence completely avoid btree splits....

> (It's guaranteed that
> none of the internal nodes need to be split because they were just
> split.)

Hmmm. That doesn't sound right - btrees don't work that way.

We don't know what path the tree split along in the previous insert
operation  - a split only guarantees the next insert along that same
path won't split if the insert hits one of the two leaf blocks from
the split.  Insert into a different leaf can still split interior
nodes all the way back up the tree the point where it intersects
with the previous full height split path.

e.g. if the previous full height split was down the left side of the
tree and the next insert is in the right half of the tree, the right
half can split on insert all the way up to the level below the old
root.

> Fix it by adding an extra block of slack in the freelist for the
> potential leaf split in each of the bnobt and cntbt.

Hmmm. yeah - example given is 2->3 level split, hence only need to
handle a single level leaf split before path intersection occurs.
Yup, adding an extra block will make the exact problem being seen go
away.

Ok, what's the general solution? 4-level, 5-level or larger trees?

Worst split after a full split is up to the level below the old root
block. the root block was split, so it won't need a split again, so
only it's children could split again. OK, so that's (height - 1)
needed for a max split to refill the AGFL after a full height split
occurred.

Hence it looks like the minimum AGFL space for any given
allocation/free operation needs to be:

	(full height split reservation) + (AGFL refill split height)

which is:

	= (new height) + (new height - 2)
	= 2 * new height - 2
	= 2 * (current height + 1) - 2
	= 2 * current height

Ok, so we essentially have to double the AGFL minimum size to handle
the generic case of AGFL refill splitting free space trees after a
transaction that has exhausted it's AGFL reservation.

Now, did I get that right?

> P.S. As far as I can tell, this bug has existed for a long time -- maybe
> back to xfs-history commit afdf80ae7405 ("Add XFS_AG_MAXLEVELS macros
> ...") in April 1994! It requires a very unlucky sequence of events, and
> in fact we didn't hit it until a particular sparse mmap workload updated
> from 5.12 to 5.19. But this bug existed in 5.12, so it must've been
> exposed by some other change in allocation or writeback patterns.

No surprise there - these sorts of fundamental reservations and
limits that largely avoid ENOSPC issues don't get touched unless a
problem is found. Who knows how many deep dark corners this specific
change will expose to the light...

> diff --git a/fs/xfs/libxfs/xfs_alloc.c b/fs/xfs/libxfs/xfs_alloc.c
> index 3069194527dd..2cbcf18fb903 100644
> --- a/fs/xfs/libxfs/xfs_alloc.c
> +++ b/fs/xfs/libxfs/xfs_alloc.c
> @@ -2275,12 +2275,16 @@ xfs_alloc_min_freelist(
>  
>  	ASSERT(mp->m_alloc_maxlevels > 0);
>  
> -	/* space needed by-bno freespace btree */
> +	/*
> +	 * space needed by-bno freespace btree: one per level that may be split
> +	 * by an insert, plus one more for a leaf split that may be necessary to
> +	 * refill the AGFL
> +	 */
>  	min_free = min_t(unsigned int, levels[XFS_BTNUM_BNOi] + 1,
> -				       mp->m_alloc_maxlevels);
> -	/* space needed by-size freespace btree */
> +				       mp->m_alloc_maxlevels) + 1;
> +	/* space needed by-size freespace btree, same as above */
>  	min_free += min_t(unsigned int, levels[XFS_BTNUM_CNTi] + 1,
> -				       mp->m_alloc_maxlevels);
> +				       mp->m_alloc_maxlevels) + 1;
>  	/* space needed reverse mapping used space btree */
>  	if (xfs_has_rmapbt(mp))
>  		min_free += min_t(unsigned int, levels[XFS_BTNUM_RMAPi] + 1,

The rmapbt case will need this change, too, because rmapbt blocks
are allocated from the free list and so an rmapbt update can exhaust
the free list completely, too.

Cheers,

Dave.
-- 
Dave Chinner
david@xxxxxxxxxxxxx



[Index of Archives]     [XFS Filesystem Development (older mail)]     [Linux Filesystem Development]     [Linux Audio Users]     [Yosemite Trails]     [Linux Kernel]     [Linux RAID]     [Linux SCSI]


  Powered by Linux