Re: [PATCH 2/4] xfs: proper replay of deferred ops queued during log recovery

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



On Mon, Sep 28, 2020 at 03:26:18PM +1000, Dave Chinner wrote:
> On Sun, Sep 27, 2020 at 04:41:20PM -0700, Darrick J. Wong wrote:
> > From: Darrick J. Wong <darrick.wong@xxxxxxxxxx>
> > 
> > When we replay unfinished intent items that have been recovered from the
> > log, it's possible that the replay will cause the creation of more
> > deferred work items.  As outlined in commit 509955823cc9c ("xfs: log
> > recovery should replay deferred ops in order"), later work items have an
> > implicit ordering dependency on earlier work items.  Therefore, recovery
> > must replay the items (both recovered and created) in the same order
> > that they would have been during normal operation.
> > 
> > For log recovery, we enforce this ordering by using an empty transaction
> > to collect deferred ops that get created in the process of recovering a
> > log intent item to prevent them from being committed before the rest of
> > the recovered intent items.  After we finish committing all the
> > recovered log items, we allocate a transaction with an enormous block
> > reservation, splice our huge list of created deferred ops into that
> > transaction, and commit it, thereby finishing all those ops.
> > 
> > This is /really/ hokey -- it's the one place in XFS where we allow
> > nested transactions; the splicing of the defer ops list is is inelegant
> > and has to be done twice per recovery function; and the broken way we
> > handle inode pointers and block reservations cause subtle use-after-free
> > and allocator problems that will be fixed by this patch and the two
> > patches after it.
> > 
> > Therefore, replace the hokey empty transaction with a structure designed
> > to capture each chain of deferred ops that are created as part of
> > recovering a single unfinished log intent.  Finally, refactor the loop
> > that replays those chains to do so using one transaction per chain.
> > 
> > Signed-off-by: Darrick J. Wong <darrick.wong@xxxxxxxxxx>
> > ---
> >  fs/xfs/libxfs/xfs_defer.c       |   56 +++++++++++++--
> >  fs/xfs/libxfs/xfs_defer.h       |   20 +++++
> >  fs/xfs/libxfs/xfs_log_recover.h |    2 +
> >  fs/xfs/xfs_bmap_item.c          |   16 +---
> >  fs/xfs/xfs_extfree_item.c       |    7 +-
> >  fs/xfs/xfs_log_recover.c        |  150 +++++++++++++++++++++++++--------------
> >  fs/xfs/xfs_refcount_item.c      |   16 +---
> >  fs/xfs/xfs_rmap_item.c          |    7 +-
> >  fs/xfs/xfs_trans.h              |    3 +
> >  9 files changed, 183 insertions(+), 94 deletions(-)
> > 
> > 
> > diff --git a/fs/xfs/libxfs/xfs_defer.c b/fs/xfs/libxfs/xfs_defer.c
> > index 36c103c14bc9..0de7672fe63d 100644
> > --- a/fs/xfs/libxfs/xfs_defer.c
> > +++ b/fs/xfs/libxfs/xfs_defer.c
> > @@ -549,14 +549,56 @@ xfs_defer_move(
> >   *
> >   * Create and log intent items for all the work that we're capturing so that we
> >   * can be assured that the items will get replayed if the system goes down
> > - * before log recovery gets a chance to finish the work it put off.  Then we
> > - * move the chain from stp to dtp.
> > + * before log recovery gets a chance to finish the work it put off.  The entire
> > + * deferred ops state is transferred to the capture structure and the
> > + * transaction is then ready for the caller to commit it.  If there are no
> > + * intent items to capture, this function returns NULL.
> >   */
> > -void
> > +struct xfs_defer_capture *
> >  xfs_defer_capture(
> > -	struct xfs_trans	*dtp,
> > -	struct xfs_trans	*stp)
> > +	struct xfs_trans		*tp)
> 
> "capture" what?
> 
> Perhaps this whole API reads better as:
> 
> xfs_defer_ops_capture()
> xfs_defer_ops_continue()
> xfs_defer_ops_release()

Yes!  Finally a better set of names!  I've been stuck on that for a
while!  I like these names very much!

> because what it is doing is moving deferops from a transaction to a
> capture structure and back again...
> 
> .....
> 
> > diff --git a/fs/xfs/libxfs/xfs_defer.h b/fs/xfs/libxfs/xfs_defer.h
> > index 3164199162b6..bc7493bf4542 100644
> > --- a/fs/xfs/libxfs/xfs_defer.h
> > +++ b/fs/xfs/libxfs/xfs_defer.h
> > @@ -8,6 +8,7 @@
> >  
> >  struct xfs_btree_cur;
> >  struct xfs_defer_op_type;
> > +struct xfs_defer_capture;
> >  
> >  /*
> >   * Header for deferred operation list.
> > @@ -63,10 +64,27 @@ extern const struct xfs_defer_op_type xfs_rmap_update_defer_type;
> >  extern const struct xfs_defer_op_type xfs_extent_free_defer_type;
> >  extern const struct xfs_defer_op_type xfs_agfl_free_defer_type;
> >  
> > +/*
> > + * Deferred operation freezer.  This structure enables a dfops user to detach
> > + * the chain of deferred operations from a transaction so that they can be
> > + * continued later.
> > + */
> 
> "freezer"?
> 
> Stale comment?

Yep.  Fixed.

> .....
> 
> > @@ -531,15 +526,12 @@ xfs_bui_item_recover(
> >  		xfs_bmap_unmap_extent(tp, ip, &irec);
> >  	}
> >  
> > -	xfs_defer_capture(parent_tp, tp);
> > -	error = xfs_trans_commit(tp);
> > +	error = xlog_recover_trans_commit(tp, capture_list);
> >  	xfs_iunlock(ip, XFS_ILOCK_EXCL);
> >  	xfs_irele(ip);
> > -
> >  	return error;
> 
> hmmmm.
> 
> > diff --git a/fs/xfs/xfs_log_recover.c b/fs/xfs/xfs_log_recover.c
> > index e0675071b39e..107965acc57e 100644
> > --- a/fs/xfs/xfs_log_recover.c
> > +++ b/fs/xfs/xfs_log_recover.c
> > @@ -1755,6 +1755,38 @@ xlog_recover_release_intent(
> >  	spin_unlock(&ailp->ail_lock);
> >  }
> >  
> > +/*
> > + * Capture any deferred ops and commit the transaction.  This is the last step
> > + * needed to finish a log intent item that we recovered from the log, and will
> > + * take care of releasing all the relevant resources.
> 
> What does "take care of releasing all the relevant resources"
> mean? 

Not much.  I'll remove it.

> > + */
> > +int
> > +xlog_recover_trans_commit(
> > +	struct xfs_trans		*tp,
> > +	struct list_head		*capture_list)
> > +{
> > +	struct xfs_mount		*mp = tp->t_mountp;
> > +	struct xfs_defer_capture	*dfc = xfs_defer_capture(tp);
> > +	int				error;
> > +
> > +	/* If we don't capture anything, commit tp and exit. */
> > +	if (!dfc)
> > +		return xfs_trans_commit(tp);
> > +
> > +	/*
> > +	 * Commit the transaction.  If that fails, clean up the defer ops and
> > +	 * the dfc that we just created.  Otherwise, add the dfc to the list.
> > +	 */
> > +	error = xfs_trans_commit(tp);
> > +	if (error) {
> > +		xfs_defer_capture_free(mp, dfc);
> > +		return error;
> > +	}
> > +
> > +	list_add_tail(&dfc->dfc_list, capture_list);
> > +	return 0;
> > +}
> 
> And, really, this is more than a "transaction commit" operation; it
> doesn't have anything recovery specific to it, so if the
> xfs_defer_capture() API is "generic xfs_defer" functionality, why
> isn't this placed next to it and nameed
> xfs_defer_capture_and_commit()?
> 
> > @@ -2431,38 +2463,62 @@ xlog_recover_process_data(
> >  	return 0;
> >  }
> >  
> > +static void
> > +xlog_cancel_defer_ops(
> > +	struct xfs_mount	*mp,
> > +	struct list_head	*capture_list)
> > +{
> > +	struct xfs_defer_capture *dfc, *next;
> > +
> > +	list_for_each_entry_safe(dfc, next, capture_list, dfc_list) {
> > +		list_del_init(&dfc->dfc_list);
> > +		xfs_defer_capture_free(mp, dfc);
> > +	}
> > +}
> 
> Same - there is nothing log recovery specific here.

Ok, I'll move these two functions to xfs_defer.c.

> >  /* Take all the collected deferred ops and finish them in order. */
> >  static int
> >  xlog_finish_defer_ops(
> > -	struct xfs_trans	*parent_tp)
> > +	struct xfs_mount	*mp,
> > +	struct list_head	*capture_list)
> >  {
> > -	struct xfs_mount	*mp = parent_tp->t_mountp;
> > +	struct xfs_defer_capture *dfc, *next;
> >  	struct xfs_trans	*tp;
> >  	int64_t			freeblks;
> > -	uint			resblks;
> > -	int			error;
> > +	uint64_t		resblks;
> > +	int			error = 0;
> >  
> > -	/*
> > -	 * We're finishing the defer_ops that accumulated as a result of
> > -	 * recovering unfinished intent items during log recovery.  We
> > -	 * reserve an itruncate transaction because it is the largest
> > -	 * permanent transaction type.  Since we're the only user of the fs
> > -	 * right now, take 93% (15/16) of the available free blocks.  Use
> > -	 * weird math to avoid a 64-bit division.
> > -	 */
> > -	freeblks = percpu_counter_sum(&mp->m_fdblocks);
> > -	if (freeblks <= 0)
> > -		return -ENOSPC;
> > -	resblks = min_t(int64_t, UINT_MAX, freeblks);
> > -	resblks = (resblks * 15) >> 4;
> > -	error = xfs_trans_alloc(mp, &M_RES(mp)->tr_itruncate, resblks,
> > -			0, XFS_TRANS_RESERVE, &tp);
> > -	if (error)
> > -		return error;
> > -	/* transfer all collected dfops to this transaction */
> > -	xfs_defer_move(tp, parent_tp);
> > +	list_for_each_entry_safe(dfc, next, capture_list, dfc_list) {
> > +		/*
> > +		 * We're finishing the defer_ops that accumulated as a result
> > +		 * of recovering unfinished intent items during log recovery.
> > +		 * We reserve an itruncate transaction because it is the
> > +		 * largest permanent transaction type.  Since we're the only
> > +		 * user of the fs right now, take 93% (15/16) of the available
> > +		 * free blocks.  Use weird math to avoid a 64-bit division.
> > +		 */
> > +		freeblks = percpu_counter_sum(&mp->m_fdblocks);
> > +		if (freeblks <= 0)
> > +			return -ENOSPC;
> >  
> > -	return xfs_trans_commit(tp);
> > +		resblks = min_t(uint64_t, UINT_MAX, freeblks);
> > +		resblks = (resblks * 15) >> 4;
> > +		error = xfs_trans_alloc(mp, &M_RES(mp)->tr_itruncate, resblks,
> > +				0, XFS_TRANS_RESERVE, &tp);
> > +		if (error)
> > +			return error;
> > +
> > +		/* transfer all collected dfops to this transaction */
> > +		list_del_init(&dfc->dfc_list);
> > +		xfs_defer_continue(dfc, tp);
> > +
> > +		error = xfs_trans_commit(tp);
> > +		xfs_defer_capture_free(mp, dfc);
> 
> Why does this need to call xfs_defer_cancel_list() here? Shouldn't

Um, this function doesn't cancel the list at all.  Are you asking why it
is that "reattach captured dfops to transaction" and "free the capture
structure" are separate steps?

> dfc->dfc_dfops already be empty here? And if it isn't, shouldn't
> that throw an error rather than silently cancle work that hasn't be
> done that should have been?

Yes, dfc_dfops should be empty at this point.  In the BUI recovery UAF
patch, this xfs_defer_capture_free call is used to unlock and release
the captured inode(s) after the transaction commits.

An alternate way to design this would have been that xfs_defer_continue
totally consumes the *dfc (and frees it) as part of reattaching the
dfops list to the transaction; and in the later patch that adds inodes
to the capture structure, we'd pass out any inode pointers and
let xlog_finish_defer_ops unlock and release it directly.

Hmm, maybe that /would/ be more clear than this setup...

> 
> >  #endif
> > -	while (lip != NULL) {
> > +	for (lip = xfs_trans_ail_cursor_first(ailp, &cur, 0);
> > +	     lip != NULL;
> > +	     lip = xfs_trans_ail_cursor_next(ailp, &cur)) {
> >  		/*
> >  		 * We're done when we see something other than an intent.
> >  		 * There should be no intents left in the AIL now.
> > @@ -2533,28 +2577,28 @@ xlog_recover_process_intents(
> >  		 */
> >  		ASSERT(XFS_LSN_CMP(last_lsn, lip->li_lsn) >= 0);
> >  
> > +		if (test_and_set_bit(XFS_LI_RECOVERED, &lip->li_flags))
> > +			continue;
> > +
> 
> Why do we still need XFS_LI_RECOVERED here? This log item is going to get
> removed from the AIL by the committing of the first transaction
> in the ->iop_recover() sequence we are running, so we'll never find
> it again in the AIL. Nothing else checks for XFS_LI_RECOVERED
> anymore, so this seems unnecessary now...

<nod> I've suspected for a while that this wasn't necessary...

> > -out:
> > +
> >  	xfs_trans_ail_cursor_done(&cur);
> >  	spin_unlock(&ailp->ail_lock);
> >  	if (!error)
> > -		error = xlog_finish_defer_ops(parent_tp);
> > -	xfs_trans_cancel(parent_tp);
> > +		error = xlog_finish_defer_ops(log->l_mp, &capture_list);
> >  
> > +	xlog_cancel_defer_ops(log->l_mp, &capture_list);
> >  	return error;
> >  }
> 
> Again, why are we cancelling the capture list if we just
> successfully processed the defer ops on the capture list?

Overly compact code, I suppose.  If xlog_finish_defer_ops succeeds, the
call is a nop since the list is empty; but if there was an error, then
we still need to walk the list and free the remaining capture
structures.

This probably would have been clearer had I ended the function like
this:

	xfs_trans_ail_cursor_done(&cur);
	spin_unlock(&ailp->ail_lock);
	if (error)
		goto err;

	error = xlog_finish_defer_ops(log->l_mp, &capture_list);
	if (error)
		goto err;

	return 0;
err:
	xfs_defer_ops_release_all(log->l_mp, &capture_list);
	return error;
}

--D

> Cheers,
> 
> Dave.
> -- 
> Dave Chinner
> david@xxxxxxxxxxxxx



[Index of Archives]     [XFS Filesystem Development (older mail)]     [Linux Filesystem Development]     [Linux Audio Users]     [Yosemite Trails]     [Linux Kernel]     [Linux RAID]     [Linux SCSI]


  Powered by Linux