Re: [PATCH 3/6] xfs: log EFIs for all btree blocks being used to stage a btree

"Darrick J. Wong" <djwong@xxxxxxxxxx> · Fri, 8 Sep 2023 16:34:47 -0700

On Mon, Aug 07, 2023 at 06:41:39PM +1000, Dave Chinner wrote:
> On Thu, Jul 27, 2023 at 03:24:32PM -0700, Darrick J. Wong wrote:
> > From: Darrick J. Wong <djwong@xxxxxxxxxx>
> > 
> > We need to log EFIs for every extent that we allocate for the purpose of
> > staging a new btree so that if we fail then the blocks will be freed
> > during log recovery.  Add a function to relog the EFIs, so that repair
> > can relog them all every time it creates a new btree block, which will
> > help us to avoid pinning the log tail.
> > 
> > Signed-off-by: Darrick J. Wong <djwong@xxxxxxxxxx>
> .....
> > +/*
> > + * Set up automatic reaping of the blocks reserved for btree reconstruction in
> > + * case we crash by logging a deferred free item for each extent we allocate so
> > + * that we can get all of the space back if we crash before we can commit the
> > + * new btree.  This function returns a token that can be used to cancel
> > + * automatic reaping if repair is successful.
> > + */
> > +static int
> > +xrep_newbt_schedule_autoreap(
> > +	struct xrep_newbt		*xnr,
> > +	struct xrep_newbt_resv		*resv)
> > +{
> > +	struct xfs_extent_free_item	efi_item = {
> > +		.xefi_blockcount	= resv->len,
> > +		.xefi_owner		= xnr->oinfo.oi_owner,
> > +		.xefi_flags		= XFS_EFI_SKIP_DISCARD,
> > +		.xefi_pag		= resv->pag,
> > +	};
> > +	struct xfs_scrub		*sc = xnr->sc;
> > +	struct xfs_log_item		*lip;
> > +	LIST_HEAD(items);
> > +
> > +	ASSERT(xnr->oinfo.oi_offset == 0);
> > +
> > +	efi_item.xefi_startblock = XFS_AGB_TO_FSB(sc->mp, resv->pag->pag_agno,
> > +			resv->agbno);
> > +	if (xnr->oinfo.oi_flags & XFS_OWNER_INFO_ATTR_FORK)
> > +		efi_item.xefi_flags |= XFS_EFI_ATTR_FORK;
> > +	if (xnr->oinfo.oi_flags & XFS_OWNER_INFO_BMBT_BLOCK)
> > +		efi_item.xefi_flags |= XFS_EFI_BMBT_BLOCK;
> > +
> > +	INIT_LIST_HEAD(&efi_item.xefi_list);
> > +	list_add(&efi_item.xefi_list, &items);
> > +
> > +	xfs_perag_intent_hold(resv->pag);
> > +	lip = xfs_extent_free_defer_type.create_intent(sc->tp, &items, 1,
> > +			false);
> 
> Hmmmm.
> 
> That triggered flashing lights and sirens - I'm not sure I really
> like the usage of the defer type arrays like this, nor the
> duplication of the defer mechanisms for relogging, etc.
> 
> Not that I have a better idea right now - is this the final form of
> this code, or is more stuff built on top of it or around it?

Soooo.  Now that another month has passed, I've had the time to think
about this topic some more.

I've flat-out rejected my own suggestion to leave the ondisk bnobt
unchanged and stash the reservations in the extent busy tree, because
the resulting IO sequence ends up being:

<read everything we need to synthesize records>
<write btree blocks to disk>

T0: commit btree root

<remove first extent we used for the btree from busy list>

T1: remove first extent from bnobt
T2: rmapbt updates for the first extent

<remove second extent we used for the btree from busy list>

T3: remove second extent from bnobt
T4: rmapbt updates for the second extent
...

Because we cannot risk having the system go down before we finish
writing all these transactions.  One could invent a new log intent items
that capture promises to allocate blocks, but that only reduces the
problem to:

T0: commit btree root
intent item for first extent
intent item for second extent
...
run out of transaction reservation

T1: more intent items

Same problem, just harder to hit.  So.  I'll requeue the "hide the space
in the extent busy tree" for some day when writing to a hole becomes hot
enough to make this important again.

---------------

However, I thought about what newbt.c really wants to do with EFIs.
A transaction removes each extent from the bnobt (until we have enough
blocks to write out the new btree) and logs an EFI to free that space.

If the new tree write succeeds and we reserved exactly enough blocks,
then all we want to do is:

1) Log the EFDs to the same transaction where we commit the new btree
   root.  We don't free the space.

If the new tree write succeeds and we reserved too many blocks, then we
want the btree root commit transaction to:

1) same as above

2) For eextents that we only wrote partially, we want to log an EFD for
   the existing EFI at the same time that we log a new EFI to free
   whatever we didn't use.

3) For the extents that we completely didn't use, we want to log EFDs
   and free the space.

If the new tree write fails, then we want the last transaction in the
chain to:

3) same as above

Also note that we want to relog EFIs if we need to move the log tail
forward.  Right now I copy and paste the xfs_defer_relog code into
newbt.c, which is awful.

We could actually use the regular defer ops extent freeing mechanism for
case (3) above, since that's already how xfs_extent_free_later works.

For case (1) and (2), we want a slight variation on deferred extent
freeing.  The EFI log item would still get created, but now we want to
mark a deferred work item to be set aside when xfs_defer_finish is
trying to call ->finish_item on tp->t_dfops.

When the btree write finishes, we mark the deferred extent free work
item as stale and remove the mark we put on that item in the previous
step.  This enables the deferred extent free item to go through the
usual xfs_defer_finish state machine so that an EFD gets created.  The
only difference is that now xfs_trans_free_extent doesn't call
__xfs_free_extent.

This also means that each _claim_block function can now call
xrep_defer_finish() to relog the EFIs between each btree block write.

----------------

As for the code that reaps old ondisk structures -- I created a new
dfops type called "barrier" so that the reap code never writes out an
EFI with more than two extents per log item.

--D

> -Dave.
> -- 
> Dave Chinner
> david@xxxxxxxxxxxxx