Re: [PATCH] xfs: Fix double unlock in defer capture code

"Darrick J. Wong" <djwong@xxxxxxxxxx> · Wed, 3 Nov 2021 18:30:07 -0700

On Thu, Nov 04, 2021 at 11:16:33AM +1100, Dave Chinner wrote:
> On Wed, Nov 03, 2021 at 02:33:09PM -0700, Allison Henderson wrote:
> > The new deferred attr patch set uncovered a double unlock in the
> > recent port of the defer ops capture and continue code.  During log
> > recovery, we're allowed to hold buffers to a transaction that's being
> > used to replay an intent item.  When we capture the resources as part
> > of scheduling a continuation of an intent chain, we call xfs_buf_hold
> > to retain our reference to the buffer beyond the transaction commit,
> > but we do /not/ call xfs_trans_bhold to maintain the buffer lock.
> > This means that xfs_defer_ops_continue needs to relock the buffers
> > before xfs_defer_restore_resources joins then tothe new transaction.
> > 
> > Additionally, the buffers should not be passed back via the dres
> > structure since they need to remain locked unlike the inodes.  So
> > simply set dr_bufs to zero after populating the dres structure.
> > 
> > Signed-off-by: Darrick J. Wong <djwong@xxxxxxxxxx>
> > Signed-off-by: Allison Henderson <allison.henderson@xxxxxxxxxx>
> > ---
> >  fs/xfs/libxfs/xfs_defer.c | 40 ++++++++++++++++++++++++++++++++++++++-
> >  1 file changed, 39 insertions(+), 1 deletion(-)
> > 
> > diff --git a/fs/xfs/libxfs/xfs_defer.c b/fs/xfs/libxfs/xfs_defer.c
> > index 0805ade2d300..734ac9fd2628 100644
> > --- a/fs/xfs/libxfs/xfs_defer.c
> > +++ b/fs/xfs/libxfs/xfs_defer.c
> > @@ -22,6 +22,7 @@
> >  #include "xfs_refcount.h"
> >  #include "xfs_bmap.h"
> >  #include "xfs_alloc.h"
> > +#include "xfs_buf.h"
> >  
> >  static struct kmem_cache	*xfs_defer_pending_cache;
> >  
> > @@ -762,6 +763,33 @@ xfs_defer_ops_capture_and_commit(
> >  	return 0;
> >  }
> >  
> > +static void
> > +xfs_defer_relock_buffers(
> > +	struct xfs_defer_capture	*dfc)
> > +{
> > +	struct xfs_defer_resources	*dres = &dfc->dfc_held;
> > +	unsigned int			i, j;
> > +
> > +	/*
> > +	 * Sort the elements via bubble sort.  (Remember, there are at most 2
> > +	 * elements to sort, so this is adequate.)
> > +	 */
> 
> Seems like overkill if we only have two buffers that can be held.
> All we need if there are only two buffers is a swap() call.
> 
> However, locking arbitrary buffers based on disk address order is
> also theoretically incorrect.
> 
> For example, if the two buffers we have held the AGF and AGI buffers
> for a given AG, then this will lock the AGF before the AGI. However,
> the lock order for AGI vs AGF is AGI first, hence we'd be locking
> these buffers in the wrong order here. Another example is that btree
> buffers are generally locked in parent->child order or
> sibling->sibling order, not disk offset order.
> 
> Hence I'm wondering is this generalisation is a safe method of
> locking buffers.
> 
> In general, the first locked and joined buffer in a transaction is
> always the first that should be locked. Hence I suspect we need to
> ensure that the dres->dr_bp array always reflects the order in which
> buffers were joined to a transaction so that we can simply lock them
> in ascending array index order and not need to care what the
> relationship between the buffers are...

/me agrees with that; I think you ought to be able to skip the sort
entirely because the dr_bp array is loaded in order of the transaction
items, which means that we'd be locking them in the same order as the
transaction.

> > +	for (i = 0; i < dres->dr_bufs; i++) {
> > +		for (j = 1; j < dres->dr_bufs; j++) {
> > +			if (xfs_buf_daddr(dres->dr_bp[j]) <
> > +				xfs_buf_daddr(dres->dr_bp[j - 1])) {
> > +				struct xfs_buf  *temp = dres->dr_bp[j];
> > +
> > +				dres->dr_bp[j] = dres->dr_bp[j - 1];
> > +				dres->dr_bp[j - 1] = temp;
> > +			}
> > +		}
> > +	}
> > +
> > +	for (i = 0; i < dres->dr_bufs; i++)
> > +		xfs_buf_lock(dres->dr_bp[i]);
> > +}
> > +
> >  /*
> >   * Attach a chain of captured deferred ops to a new transaction and free the
> >   * capture structure.  If an inode was captured, it will be passed back to the
> > @@ -777,15 +805,25 @@ xfs_defer_ops_continue(
> >  	ASSERT(tp->t_flags & XFS_TRANS_PERM_LOG_RES);
> >  	ASSERT(!(tp->t_flags & XFS_TRANS_DIRTY));
> >  
> > -	/* Lock and join the captured inode to the new transaction. */
> > +	/* Lock the captured resources to the new transaction. */
> >  	if (dfc->dfc_held.dr_inos == 2)
> >  		xfs_lock_two_inodes(dfc->dfc_held.dr_ip[0], XFS_ILOCK_EXCL,
> >  				    dfc->dfc_held.dr_ip[1], XFS_ILOCK_EXCL);
> >  	else if (dfc->dfc_held.dr_inos == 1)
> >  		xfs_ilock(dfc->dfc_held.dr_ip[0], XFS_ILOCK_EXCL);
> > +
> > +	xfs_defer_relock_buffers(dfc);
> > +
> > +	/* Join the captured resources to the new transaction. */
> >  	xfs_defer_restore_resources(tp, &dfc->dfc_held);
> >  	memcpy(dres, &dfc->dfc_held, sizeof(struct xfs_defer_resources));
> >  
> > +	/*
> > +	 * Inodes must be passed back to the log recovery code to be unlocked,
> > +	 * but buffers do not.  Ignore the captured buffers
> > +	 */
> > +	dres->dr_bufs = 0;
> 
> I'm not sure what this comment is supposed to indicate. This seems
> to be infrastructure specific to log recovery, not general runtime
> functionality, but even in that context I don't really understand
> what it means or why it is done...

The defer_capture machinery picks up inodes that were ijoined with
lock_flags==0 (i.e. caller will unlock explicitly), which is why they
have to be passed back out after the entire transaction sequence
completes.

By contrast, the defer capture machinery picks up buffers with BLI_HOLD
set on the log item.  These are only meant to maintain the hold across
the next transaction roll (or the next defer_finish invocation), which
means that the caller is responsible for unlocking and releasing the
buffer (or I guess re-holding it) during that next transaction.

--D

> Cheers,
> 
> Dave.
> -- 
> Dave Chinner
> david@xxxxxxxxxxxxx