Re: [PATCHSET v26.0 0/9] xfs: fix online repair block reaping

"Darrick J. Wong" <djwong@xxxxxxxxxx> · Mon, 7 Aug 2023 17:40:07 -0700

On Mon, Aug 07, 2023 at 04:19:11PM +1000, Dave Chinner wrote:
> On Thu, Jul 27, 2023 at 03:18:32PM -0700, Darrick J. Wong wrote:
> > Hi all,
> > 
> > These patches fix a few problems that I noticed in the code that deals
> > with old btree blocks after a successful repair.
> > 
> > First, I observed that it is possible for repair to incorrectly
> > invalidate and delete old btree blocks if they were crosslinked.  The
> > solution here is to consult the reverse mappings for each block in the
> > extent -- singly owned blocks are invalidated and freed, whereas for
> > crosslinked blocks, we merely drop the incorrect reverse mapping.
> > 
> > A largeish change in this patchset is moving the reaping code to a
> > separate file, because the code are mostly interrelated static
> > functions.  For now this also drops the ability to reap file blocks,
> > which will return when we add the bmbt repair functions.
> > 
> > Second, we convert the reap function to use EFIs so that we can commit
> > to freeing as many blocks in as few transactions as we dare.  We would
> > like to free as many old blocks as we can in the same transaction that
> > commits the new structure to the ondisk filesystem to minimize the
> > number of blocks that leak if the system crashes before the repair fully
> > completes.
> > 
> > The third change made in this series is to avoid tripping buffer cache
> > assertions if we're merely scanning the buffer cache for buffers to
> > invalidate, and find a non-stale buffer of the wrong length.  This is
> > primarily cosmetic, but makes my life easier.
> > 
> > The fourth change restructures the reaping code to try to process as many
> > blocks in one go as possible, to reduce logging traffic.
> > 
> > The last change switches the reaping mechanism to use per-AG bitmaps
> > defined in a previous patchset.  This should reduce type confusion when
> > reading the source code.
> > 
> > If you're going to start using this mess, you probably ought to just
> > pull from my git trees, which are linked below.
> > 
> > This is an extraordinary way to destroy everything.  Enjoy!
> > Comments and questions are, as always, welcome.
> 
> Overall I don't see any red flags, so from that perspective I think
> it's good to merge as is. THe buffer cache interactions are much
> neater this time around.
> 
> Reviewed-by: Dave Chinner <dchinner@xxxxxxxxxx>

Thanks!

> The main thing I noticed is that the deferred freeing mechanism ifo
> rbulk reaping will add up to 128 XEFIs to the transaction. That
> could result in a single EFI with up to 128 extents in it, right?

Welllp... the defer ops code only logs up to 16 extents per EFI log item
due to my, er, butchering of max_items.  So in the end, we log up to 8x
EFI items, each of which has up to 16y EFIs...

> What happens when we try to free that many extents in a single
> transaction loop? The extent free processing doesn't have a "have we
> run out of transaction reservation" check in it like the refcount
> item processing does, so I don't think it can roll to renew the
> transaction reservation if it is needed. DO we need to catch this
> and renew the reservation by returning -EAGAIN from
> xfs_extent_free_finish_item() if there isn't enough of a reservation
> remaining to free an extent?

...and by my estimation, those eight items consume a fraction of the
reservation available with tr_itruncate:

16 x xfs_extent_64_t   = 256 bytes
1 x xfs_efi_log_format = 8 bytes
                       = 272 bytes per EFI

8 x EFI                = 2176 bytes

So far, I haven't seen any overflows with the reaping code -- for the AG
btree rebuilders, we end up logging and relogging the same bnobt/cntbt
buffers over and over again.  tr_itruncate gives us ~320K per transaction,
and I haven't seen any overflows yet.

This might actually become more of a problem if we want to reap an
inode-rooted btree, each block is in a different AG, and there are (say)
more than 128 AGs... but I think the solution to that will be to
xrep_defer_finish between AGs.

--D

> -Dave.
> -- 
> Dave Chinner
> david@xxxxxxxxxxxxx