Re: [PATCH] xfs: fix internal error from AGFL exhaustion

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



On Wed, Oct 25, 2023 at 03:39:09PM -0700, Darrick J. Wong wrote:
> On Wed, Oct 25, 2023 at 01:03:00PM -0700, Omar Sandoval wrote:
> > On Wed, Oct 25, 2023 at 08:50:46AM -0700, Darrick J. Wong wrote:
> > > On Tue, Oct 24, 2023 at 10:14:07PM -0700, Omar Sandoval wrote:
> > > > On Wed, Oct 25, 2023 at 03:35:17PM +1100, Dave Chinner wrote:
> > > > > On Tue, Oct 24, 2023 at 04:37:33PM -0700, Omar Sandoval wrote:
> > > > > > From: Omar Sandoval <osandov@xxxxxx>
> > > > > > Fix it by adding an extra block of slack in the freelist for the
> > > > > > potential leaf split in each of the bnobt and cntbt.
> > > 
> > > I see how the cntbt can split because changing the length of a freespace
> > > extent can require the record to move to a totally different part of the
> > > cntbt.  The reinsertion causes the split.
> > > 
> > > How does the bnobt split during a refresh of the AGFL?  Will we ever
> > > allocate a block from the /middle/ of a free space extent?
> > 
> > That's the case I was worried about for the bnobt. I see two ways that
> > can happen:
> > 
> > 1. alignment, prod, or mod requirements, which the freelist doesn't
> >    have.
> > 2. Busy extents. I don't know enough about XFS to say whether or not
> >    this is applicable, but at first glance I don't see why not.
> 
> Yes, I think it is possible -- we allocate an extent to fill the AGFL,
> the beginning of that extent is still busy due to slow discard, so
> xfs_alloc_compute_aligned gives us a block from the middle of the free
> space.  Since AGFL filling has no particular alignment/prod/mod, any
> number of blocks are good enough, so we end up taking the blocks from
> the middle of the extent.

*nod*

That's exactly the conclusion I came to when I wondered how it was
possible...

> > > > > Hmmm. yeah - example given is 2->3 level split, hence only need to
> > > > > handle a single level leaf split before path intersection occurs.
> > > > > Yup, adding an extra block will make the exact problem being seen go
> > > > > away.
> > > > > 
> > > > > Ok, what's the general solution? 4-level, 5-level or larger trees?
> > > > > 
> > > > > Worst split after a full split is up to the level below the old root
> > > > > block. the root block was split, so it won't need a split again, so
> > > > > only it's children could split again. OK, so that's (height - 1)
> > > > > needed for a max split to refill the AGFL after a full height split
> > > > > occurred.
> > > > > 
> > > > > Hence it looks like the minimum AGFL space for any given
> > > > > allocation/free operation needs to be:
> > > > > 
> > > > > 	(full height split reservation) + (AGFL refill split height)
> > > > > 
> > > > > which is:
> > > > > 
> > > > > 	= (new height) + (new height - 2)
> > > > > 	= 2 * new height - 2
> > > > > 	= 2 * (current height + 1) - 2
> > > > > 	= 2 * current height
> > > > > 
> > > > > Ok, so we essentially have to double the AGFL minimum size to handle
> > > > > the generic case of AGFL refill splitting free space trees after a
> > > > > transaction that has exhausted it's AGFL reservation.
> > > > > 
> > > > > Now, did I get that right?
> > > > 
> > > > That sounds right, but I'll have to think about it more after some sleep
> > > > :)
> > > 
> > > I think that's correct, assuming that (2 * current_height) is the
> > > per-btree calcluation.
> > 
> > Agreed, except when the tree is already the maximum level. In that case,
> > the worst case is splitting up to but not including the root node twice,
> > which is 2 * height - 2 (i.e., stopping before Dave substituted new
> > height with current height + 1). So I think we want:
> > 
> > min_free = min_t(unsigned int, levels[XFS_BTNUM_CNTi] + 1,
> > 			       mp->m_alloc_maxlevels) * 2 - 2;
> > 
> > > > Assuming that is correct, your LE search suggestion sounds kind of nice
> > > > rather than such a drastic change to the AGFL size.
> > 
> > Come to think of it, if there is nothing <= the desired size but there
> > is something >, we have no choice but to do the split, so that doesn't
> > work.
> 
> Yeah, I was kind of wondering that myself.

Ok, I'm glad to see there are enough eyes on this to point out the
things I missed when musing about it as a possible solution...

> > > The absolute maximum height of a free space btree is 7 blocks, according
> > > to xfs_db:
> > > 
> > > # xfs_db /dev/sda -c 'btheight -w absmax all'
> > > bnobt: 7
> > > cntbt: 7
> > > inobt: 6
> > > finobt: 6
> > > bmapbt: 14
> > > refcountbt: 6
> > > rmapbt: 10
> > > 
> > > The smallest AGFL is 62 records long (V4 fs, 512b blocks) so I don't
> > > think it's /that/ drastic.  For a filesystem with rmapbt (V5, 1k blocks)
> > > that minimum jumps to 121 blocks.

Yup. keep in mind this comment in xfs_alloc_fix_freelist():

	 * XXX (dgc): When we have lots of free space, does this buy us
         * anything other than extra overhead when we need to put more blocks
         * back on the free list? Maybe we should only do this when space is
         * getting low or the AGFL is more than half full?

IOWs, I've previously mused about keeping the AGFL much fuller than
the absolute minimum. There doesn't seem to be any real gain to
keeping it at just the minimum size when we are nowhere near ENOSPC,
it just means we are doing lots of management operations like
reducing it immediately after a transaction that has released blocks
to the freelist, then only to have to extend it again when we do an
operation that consumes blocks from the free list.

i.e. should we add also hysteresis to limit the number of AGFL
size manipulations we need to do in mixed workloads? i.e. we don't free
blocks from the AGFL until it might get overfilled by an operation
(i.e. max size - max merge free blocks) or we are nearing ENOSPC,
and we don't fill till we are at the minimum. In either case, the
target for empty or fill operations is "half full"...

Cheers,

Dave.
-- 
Dave Chinner
david@xxxxxxxxxxxxx



[Index of Archives]     [XFS Filesystem Development (older mail)]     [Linux Filesystem Development]     [Linux Audio Users]     [Yosemite Trails]     [Linux Kernel]     [Linux RAID]     [Linux SCSI]


  Powered by Linux