Re: [PATCH RESEND v2 01/18] xfs: Fix multi-transaction larp replay

Dave Chinner <david@xxxxxxxxxxxxx> · Wed, 10 Aug 2022 11:58:09 +1000

On Tue, Aug 09, 2022 at 09:52:55AM -0700, Darrick J. Wong wrote:
> On Thu, Aug 04, 2022 at 12:39:56PM -0700, Allison Henderson wrote:
> > Recent parent pointer testing has exposed a bug in the underlying
> > attr replay.  A multi transaction replay currently performs a
> > single step of the replay, then deferrs the rest if there is more
> > to do.

Yup.

> > This causes race conditions with other attr replays that
> > might be recovered before the remaining deferred work has had a
> > chance to finish.

What other attr replays are we racing against?  There can only be
one incomplete attr item intent/done chain per inode present in log
recovery, right?

> > This can lead to interleaved set and remove
> > operations that may clobber the attribute fork.  Fix this by
> > deferring all work for any attribute operation.

Which means this should be an impossible situation.

That is, if we crash before the final attrd DONE intent is written
to the log, it means that new attr intents for modifications made
*after* the current attr modification was completed will not be
present in the log. We have strict ordering of committed operations
in the journal, hence an operation on an inode has an incomplete
intent *must* be the last operation and the *only* incomplete intent
that is found in the journal for that inode.

Hence from an operational ordering persepective, this explanation
for issue being seen doesn't make any sense to me.  If there are
multiple incomplete attri intents then we've either got a runtime
journalling problem (a white-out issue? failing to relog the inode
in each new intent?) or a log recovery problem (failing to match
intent-done pairs correctly?), not a recovery deferral issue.

Hence I think we're still looking for the root cause of this
problem...

> > Signed-off-by: Allison Henderson <allison.henderson@xxxxxxxxxx>
> > ---
> >  fs/xfs/xfs_attr_item.c | 35 ++++++++---------------------------
> >  1 file changed, 8 insertions(+), 27 deletions(-)
> > 
> > diff --git a/fs/xfs/xfs_attr_item.c b/fs/xfs/xfs_attr_item.c
> > index 5077a7ad5646..c13d724a3e13 100644
> > --- a/fs/xfs/xfs_attr_item.c
> > +++ b/fs/xfs/xfs_attr_item.c
> > @@ -635,52 +635,33 @@ xfs_attri_item_recover(
> >  		break;
> >  	case XFS_ATTRI_OP_FLAGS_REMOVE:
> >  		if (!xfs_inode_hasattr(args->dp))
> > -			goto out;
> > +			return 0;
> >  		attr->xattri_dela_state = xfs_attr_init_remove_state(args);
> >  		break;
> >  	default:
> >  		ASSERT(0);
> > -		error = -EFSCORRUPTED;
> > -		goto out;
> > +		return -EFSCORRUPTED;
> >  	}
> >  
> >  	xfs_init_attr_trans(args, &tres, &total);
> >  	error = xfs_trans_alloc(mp, &tres, total, 0, XFS_TRANS_RESERVE, &tp);
> >  	if (error)
> > -		goto out;
> > +		return error;
> >  
> >  	args->trans = tp;
> >  	done_item = xfs_trans_get_attrd(tp, attrip);
> > +	args->trans->t_flags |= XFS_TRANS_HAS_INTENT_DONE;
> > +	set_bit(XFS_LI_DIRTY, &done_item->attrd_item.li_flags);
> >  
> >  	xfs_ilock(ip, XFS_ILOCK_EXCL);
> >  	xfs_trans_ijoin(tp, ip, 0);
> >  
> > -	error = xfs_xattri_finish_update(attr, done_item);
> > -	if (error == -EAGAIN) {
> > -		/*
> > -		 * There's more work to do, so add the intent item to this
> > -		 * transaction so that we can continue it later.
> > -		 */
> > -		xfs_defer_add(tp, XFS_DEFER_OPS_TYPE_ATTR, &attr->xattri_list);
> > -		error = xfs_defer_ops_capture_and_commit(tp, capture_list);
> > -		if (error)
> > -			goto out_unlock;
> > -
> > -		xfs_iunlock(ip, XFS_ILOCK_EXCL);
> > -		xfs_irele(ip);
> > -		return 0;
> > -	}
> > -	if (error) {
> > -		xfs_trans_cancel(tp);
> > -		goto out_unlock;
> > -	}
> > -
> > +	xfs_defer_add(tp, XFS_DEFER_OPS_TYPE_ATTR, &attr->xattri_list);
> 
> This seems a little convoluted to me.  Maybe?  Maybe not?
> 
> 1. Log recovery recreates an incore xfs_attri_log_item from what it
> finds in the log.
> 
> 2. This function then logs an xattrd for the recovered xattri item.
> 
> 3. Then it creates a new xfs_attr_intent to complete the operation.
> 
> 4. Finally, it calls xfs_defer_ops_capture_and_commit, which logs a new
> xattri for the intent created in step 3 and also commits the xattrd for
> the first xattri.
> 
> IOWs, the only difference between before and after is that we're not
> advancing one more step through the state machine as part of log
> recovery.  From the perspective of the log, the recovery function merely
> replaces the recovered xattri log item with a new one.
> 
> Why can't we just attach the recovered xattri to the xfs_defer_pending
> that is created to point to the xfs_attr_intent that's created in step
> 3, and skip the xattrd?

Remember that attribute intents are different to all other intent
types that we have. The existing extent based intents define a
single indepedent operation that needs to be performed, and each
step of the intent chain is completely independent of the previous
step in the chain.  e.g. removing the extent from the rmap btree is
completely independent of removing it from the inode bmap btree -
all that matters is that the removal from the bmbt happens first.
The rmapbt removal can happen at any time after than, and is
completely independent of any other bmbt or rmapbt operation.
Similarly, the EFI can processed independently of all bmapbt and
rmapbt modifications, it just has to happen after those
modifications are done.

Hence if we crash during recovery, we can just restart from
where-ever we got to in the middle of the intent chains and not have
to care at all.  IOWs, eventual consistency works with these chains
because there is no dependencies between each step of the intent
chain and each step is completely independent of the other steps.

Attribute intent chains are completely different. They link steps in
a state machine together in a non-trivial, highly dependent chain.
We can't just restart the chain in the middle like we can for the
BUI->RUI->CUI->EFI chain because the on-disk attribute is in an
unknown state and recovering that exact state is .... complex.

Hence the the first step of recovery is to return the attribute we
are trying to modify back to a known state. That means we have to
perform a removal of any existing attribute under that name first.
Hence this first step should be replacing the existing attr intent
with the intent that defines the recovery operation we are going to
perform.

That means we need to translate set to replace so that cleanup is
run first, replace needs to clean up the attr under that name
regardless of whether it has the incomplete bit set on it or not.
Remove is the only operation that runs the same as at runtime, as
cleanup for remove is just repeating the remove operation from
scratch.

> I /think/ the answer to that question is that we might need to move the
> log tail forward to free enough log space to finish the intent items, so
> creating the extra xattrd/xattri (a) avoid the complexity of submitting
> an incore intent item *and* a log intent item to the defer ops
> machinery; and (b) avoid livelocks in log recovery.  Therefore, we
> actually need to do it this way.

We really need the initial operation to rewrite the intent to match
the recovery operation we are going to perform. Everything else is
secondary.

Cheers,

Dave.
-- 
Dave Chinner
david@xxxxxxxxxxxxx