Re: [PATCHSET v26.0 0/9] xfs: fix online repair block reaping

Dave Chinner <david@xxxxxxxxxxxxx> · Tue, 8 Aug 2023 15:17:21 +1000

On Mon, Aug 07, 2023 at 05:40:07PM -0700, Darrick J. Wong wrote:
> On Mon, Aug 07, 2023 at 04:19:11PM +1000, Dave Chinner wrote:
> > On Thu, Jul 27, 2023 at 03:18:32PM -0700, Darrick J. Wong wrote:
> > > Hi all,
> > > 
> > > These patches fix a few problems that I noticed in the code that deals
> > > with old btree blocks after a successful repair.
> > > 
> > > First, I observed that it is possible for repair to incorrectly
> > > invalidate and delete old btree blocks if they were crosslinked.  The
> > > solution here is to consult the reverse mappings for each block in the
> > > extent -- singly owned blocks are invalidated and freed, whereas for
> > > crosslinked blocks, we merely drop the incorrect reverse mapping.
> > > 
> > > A largeish change in this patchset is moving the reaping code to a
> > > separate file, because the code are mostly interrelated static
> > > functions.  For now this also drops the ability to reap file blocks,
> > > which will return when we add the bmbt repair functions.
> > > 
> > > Second, we convert the reap function to use EFIs so that we can commit
> > > to freeing as many blocks in as few transactions as we dare.  We would
> > > like to free as many old blocks as we can in the same transaction that
> > > commits the new structure to the ondisk filesystem to minimize the
> > > number of blocks that leak if the system crashes before the repair fully
> > > completes.
> > > 
> > > The third change made in this series is to avoid tripping buffer cache
> > > assertions if we're merely scanning the buffer cache for buffers to
> > > invalidate, and find a non-stale buffer of the wrong length.  This is
> > > primarily cosmetic, but makes my life easier.
> > > 
> > > The fourth change restructures the reaping code to try to process as many
> > > blocks in one go as possible, to reduce logging traffic.
> > > 
> > > The last change switches the reaping mechanism to use per-AG bitmaps
> > > defined in a previous patchset.  This should reduce type confusion when
> > > reading the source code.
> > > 
> > > If you're going to start using this mess, you probably ought to just
> > > pull from my git trees, which are linked below.
> > > 
> > > This is an extraordinary way to destroy everything.  Enjoy!
> > > Comments and questions are, as always, welcome.
> > 
> > Overall I don't see any red flags, so from that perspective I think
> > it's good to merge as is. THe buffer cache interactions are much
> > neater this time around.
> > 
> > Reviewed-by: Dave Chinner <dchinner@xxxxxxxxxx>
> 
> Thanks!
> 
> > The main thing I noticed is that the deferred freeing mechanism ifo
> > rbulk reaping will add up to 128 XEFIs to the transaction. That
> > could result in a single EFI with up to 128 extents in it, right?
> 
> Welllp... the defer ops code only logs up to 16 extents per EFI log item
> due to my, er, butchering of max_items.  So in the end, we log up to 8x
> EFI items, each of which has up to 16y EFIs...
> 
> > What happens when we try to free that many extents in a single
> > transaction loop? The extent free processing doesn't have a "have we
> > run out of transaction reservation" check in it like the refcount
> > item processing does, so I don't think it can roll to renew the
> > transaction reservation if it is needed. DO we need to catch this
> > and renew the reservation by returning -EAGAIN from
> > xfs_extent_free_finish_item() if there isn't enough of a reservation
> > remaining to free an extent?
> 
> ...and by my estimation, those eight items consume a fraction of the
> reservation available with tr_itruncate:
> 
> 16 x xfs_extent_64_t   = 256 bytes
> 1 x xfs_efi_log_format = 8 bytes
>                        = 272 bytes per EFI
> 
> 8 x EFI                = 2176 bytes

I'm not worried by the EFIs themselves when they are created and
committed, it's the processing of the XEFIs which are all done in a
single transaction unless a ->finish_item() call returns -EAGAIN.
i.e. it's the xfs_trans_free_extent() calls that are done one after
another, and potential log different AG metadata blocks on each
extent free operation....

And it's not just runtime we have to worry about - if we crash and
have to recover on of these EFIs with 16 extents in it, we have the
problem of processing a 16 extent EFI on a single transaction
reservation, right?

> So far, I haven't seen any overflows with the reaping code -- for the AG
> btree rebuilders, we end up logging and relogging the same bnobt/cntbt
> buffers over and over again.  tr_itruncate gives us ~320K per transaction,
> and I haven't seen any overflows yet.

I suspect it might be different with aged filesystems where the
extents being freed could be spread across many, many btree leaf
nodes...

-Dave.
-- 
Dave Chinner
david@xxxxxxxxxxxxx