Re: [PATCH 4/4] xfs: fix AGF vs inode cluster buffer deadlock

"Darrick J. Wong" <djwong@xxxxxxxxxx> · Thu, 1 Jun 2023 07:38:26 -0700

On Thu, Jun 01, 2023 at 11:51:49AM +1000, Dave Chinner wrote:
> Friendly Ping.
> 
> Apart from the stray trace_printk()s I forgot to remove, are
> there any other problems with this patch I need to fix?

Nothing obvious that I could see.

Reviewed-by: Darrick J. Wong <djwong@xxxxxxxxxx>

--D

> -Dave.
> 
> On Tue, May 16, 2023 at 06:26:29PM -0700, Darrick J. Wong wrote:
> > On Wed, May 17, 2023 at 10:04:49AM +1000, Dave Chinner wrote:
> > > From: Dave Chinner <dchinner@xxxxxxxxxx>
> > > 
> > > Lock order in XFS is AGI -> AGF, hence for operations involving
> > > inode unlinked list operations we always lock the AGI first. Inode
> > > unlinked list operations operate on the inode cluster buffer,
> > > so the lock order there is AGI -> inode cluster buffer.
> > > 
> > > For O_TMPFILE operations, this now means the lock order set down in
> > > xfs_rename and xfs_link is AGI -> inode cluster buffer -> AGF as the
> > > unlinked ops are done before the directory modifications that may
> > > allocate space and lock the AGF.
> > > 
> > > Unfortunately, we also now lock the inode cluster buffer when
> > > logging an inode so that we can attach the inode to the cluster
> > > buffer and pin it in memory. This creates a lock order of AGF ->
> > > inode cluster buffer in directory operations as we have to log the
> > > inode after we've allocated new space for it.
> > > 
> > > This creates a lock inversion between the AGF and the inode cluster
> > > buffer. Because the inode cluster buffer is shared across multiple
> > > inodes, the inversion is not specific to individual inodes but can
> > > occur when inodes in the same cluster buffer are accessed in
> > > different orders.
> > > 
> > > To fix this we need move all the inode log item cluster buffer
> > > interactions to the end of the current transaction. Unfortunately,
> > > xfs_trans_log_inode() calls are littered throughout the transactions
> > > with no thought to ordering against other items or locking. This
> > > makes it difficult to do anything that involves changing the call
> > > sites of xfs_trans_log_inode() to change locking orders.
> > > 
> > > However, we do now have a mechanism that allows is to postpone dirty
> > > item processing to just before we commit the transaction: the
> > > ->iop_precommit method. This will be called after all the
> > > modifications are done and high level objects like AGI and AGF
> > > buffers have been locked and modified, thereby providing a mechanism
> > > that guarantees we don't lock the inode cluster buffer before those
> > > high level objects are locked.
> > > 
> > > This change is largely moving the guts of xfs_trans_log_inode() to
> > > xfs_inode_item_precommit() and providing an extra flag context in
> > > the inode log item to track the dirty state of the inode in the
> > > current transaction. This also means we do a lot less repeated work
> > > in xfs_trans_log_inode() by only doing it once per transaction when
> > > all the work is done.
> > 
> > Aha, and that's why you moved all the "opportunistically tweak inode
> > metadata while we're already logging it" bits to the precommit hook.
> > 
> > > Fixes: 298f7bec503f ("xfs: pin inode backing buffer to the inode log item")
> > > Signed-off-by: Dave Chinner <dchinner@xxxxxxxxxx>
> > > ---
> > >  fs/xfs/libxfs/xfs_log_format.h  |   9 +-
> > >  fs/xfs/libxfs/xfs_trans_inode.c | 115 +++---------------------
> > >  fs/xfs/xfs_inode_item.c         | 152 ++++++++++++++++++++++++++++++++
> > >  fs/xfs/xfs_inode_item.h         |   1 +
> > >  4 files changed, 171 insertions(+), 106 deletions(-)
> > > 
> > > diff --git a/fs/xfs/libxfs/xfs_log_format.h b/fs/xfs/libxfs/xfs_log_format.h
> > > index f13e0809dc63..269573c82808 100644
> > > --- a/fs/xfs/libxfs/xfs_log_format.h
> > > +++ b/fs/xfs/libxfs/xfs_log_format.h
> > > @@ -324,7 +324,6 @@ struct xfs_inode_log_format_32 {
> > >  #define XFS_ILOG_DOWNER	0x200	/* change the data fork owner on replay */
> > >  #define XFS_ILOG_AOWNER	0x400	/* change the attr fork owner on replay */
> > >  
> > > -
> > >  /*
> > >   * The timestamps are dirty, but not necessarily anything else in the inode
> > >   * core.  Unlike the other fields above this one must never make it to disk
> > > @@ -333,6 +332,14 @@ struct xfs_inode_log_format_32 {
> > >   */
> > >  #define XFS_ILOG_TIMESTAMP	0x4000
> > >  
> > > +/*
> > > + * The version field has been changed, but not necessarily anything else of
> > > + * interest. This must never make it to disk - it is used purely to ensure that
> > > + * the inode item ->precommit operation can update the fsync flag triggers
> > > + * in the inode item correctly.
> > > + */
> > > +#define XFS_ILOG_IVERSION	0x8000
> > > +
> > >  #define	XFS_ILOG_NONCORE	(XFS_ILOG_DDATA | XFS_ILOG_DEXT | \
> > >  				 XFS_ILOG_DBROOT | XFS_ILOG_DEV | \
> > >  				 XFS_ILOG_ADATA | XFS_ILOG_AEXT | \
> > > diff --git a/fs/xfs/libxfs/xfs_trans_inode.c b/fs/xfs/libxfs/xfs_trans_inode.c
> > > index 8b5547073379..2d164d0588b1 100644
> > > --- a/fs/xfs/libxfs/xfs_trans_inode.c
> > > +++ b/fs/xfs/libxfs/xfs_trans_inode.c
> > > @@ -40,9 +40,8 @@ xfs_trans_ijoin(
> > >  	iip->ili_lock_flags = lock_flags;
> > >  	ASSERT(!xfs_iflags_test(ip, XFS_ISTALE));
> > >  
> > > -	/*
> > > -	 * Get a log_item_desc to point at the new item.
> > > -	 */
> > > +	/* Reset the per-tx dirty context and add the item to the tx. */
> > > +	iip->ili_dirty_flags = 0;
> > >  	xfs_trans_add_item(tp, &iip->ili_item);
> > >  }
> > >  
> > > @@ -76,17 +75,10 @@ xfs_trans_ichgtime(
> > >  /*
> > >   * This is called to mark the fields indicated in fieldmask as needing to be
> > >   * logged when the transaction is committed.  The inode must already be
> > > - * associated with the given transaction.
> > > - *
> > > - * The values for fieldmask are defined in xfs_inode_item.h.  We always log all
> > > - * of the core inode if any of it has changed, and we always log all of the
> > > - * inline data/extents/b-tree root if any of them has changed.
> > > - *
> > > - * Grab and pin the cluster buffer associated with this inode to avoid RMW
> > > - * cycles at inode writeback time. Avoid the need to add error handling to every
> > > - * xfs_trans_log_inode() call by shutting down on read error.  This will cause
> > > - * transactions to fail and everything to error out, just like if we return a
> > > - * read error in a dirty transaction and cancel it.
> > > + * associated with the given transaction. All we do here is record where the
> > > + * inode was dirtied and mark the transaction and inode log item dirty;
> > > + * everything else is done in the ->precommit log item operation after the
> > > + * changes in the transaction have been completed.
> > >   */
> > >  void
> > >  xfs_trans_log_inode(
> > > @@ -96,7 +88,6 @@ xfs_trans_log_inode(
> > >  {
> > >  	struct xfs_inode_log_item *iip = ip->i_itemp;
> > >  	struct inode		*inode = VFS_I(ip);
> > > -	uint			iversion_flags = 0;
> > >  
> > >  	ASSERT(iip);
> > >  	ASSERT(xfs_isilocked(ip, XFS_ILOCK_EXCL));
> > > @@ -104,18 +95,6 @@ xfs_trans_log_inode(
> > >  
> > >  	tp->t_flags |= XFS_TRANS_DIRTY;
> > >  
> > > -	/*
> > > -	 * Don't bother with i_lock for the I_DIRTY_TIME check here, as races
> > > -	 * don't matter - we either will need an extra transaction in 24 hours
> > > -	 * to log the timestamps, or will clear already cleared fields in the
> > > -	 * worst case.
> > > -	 */
> > > -	if (inode->i_state & I_DIRTY_TIME) {
> > > -		spin_lock(&inode->i_lock);
> > > -		inode->i_state &= ~I_DIRTY_TIME;
> > > -		spin_unlock(&inode->i_lock);
> > > -	}
> > > -
> > >  	/*
> > >  	 * First time we log the inode in a transaction, bump the inode change
> > >  	 * counter if it is configured for this to occur. While we have the
> > > @@ -128,86 +107,12 @@ xfs_trans_log_inode(
> > >  	if (!test_and_set_bit(XFS_LI_DIRTY, &iip->ili_item.li_flags)) {
> > >  		if (IS_I_VERSION(inode) &&
> > >  		    inode_maybe_inc_iversion(inode, flags & XFS_ILOG_CORE))
> > > -			iversion_flags = XFS_ILOG_CORE;
> > > +			flags |= XFS_ILOG_IVERSION;
> > >  	}
> > >  
> > > -	/*
> > > -	 * If we're updating the inode core or the timestamps and it's possible
> > > -	 * to upgrade this inode to bigtime format, do so now.
> > > -	 */
> > > -	if ((flags & (XFS_ILOG_CORE | XFS_ILOG_TIMESTAMP)) &&
> > > -	    xfs_has_bigtime(ip->i_mount) &&
> > > -	    !xfs_inode_has_bigtime(ip)) {
> > > -		ip->i_diflags2 |= XFS_DIFLAG2_BIGTIME;
> > > -		flags |= XFS_ILOG_CORE;
> > > -	}
> > > -
> > > -	/*
> > > -	 * Inode verifiers do not check that the extent size hint is an integer
> > > -	 * multiple of the rt extent size on a directory with both rtinherit
> > > -	 * and extszinherit flags set.  If we're logging a directory that is
> > > -	 * misconfigured in this way, clear the hint.
> > > -	 */
> > > -	if ((ip->i_diflags & XFS_DIFLAG_RTINHERIT) &&
> > > -	    (ip->i_diflags & XFS_DIFLAG_EXTSZINHERIT) &&
> > > -	    (ip->i_extsize % ip->i_mount->m_sb.sb_rextsize) > 0) {
> > > -		ip->i_diflags &= ~(XFS_DIFLAG_EXTSIZE |
> > > -				   XFS_DIFLAG_EXTSZINHERIT);
> > > -		ip->i_extsize = 0;
> > > -		flags |= XFS_ILOG_CORE;
> > > -	}
> > > -
> > > -	/*
> > > -	 * Record the specific change for fdatasync optimisation. This allows
> > > -	 * fdatasync to skip log forces for inodes that are only timestamp
> > > -	 * dirty.
> > > -	 */
> > > -	spin_lock(&iip->ili_lock);
> > > -	iip->ili_fsync_fields |= flags;
> > > -
> > > -	if (!iip->ili_item.li_buf) {
> > > -		struct xfs_buf	*bp;
> > > -		int		error;
> > > -
> > > -		/*
> > > -		 * We hold the ILOCK here, so this inode is not going to be
> > > -		 * flushed while we are here. Further, because there is no
> > > -		 * buffer attached to the item, we know that there is no IO in
> > > -		 * progress, so nothing will clear the ili_fields while we read
> > > -		 * in the buffer. Hence we can safely drop the spin lock and
> > > -		 * read the buffer knowing that the state will not change from
> > > -		 * here.
> > > -		 */
> > > -		spin_unlock(&iip->ili_lock);
> > > -		error = xfs_imap_to_bp(ip->i_mount, tp, &ip->i_imap, &bp);
> > > -		if (error) {
> > > -			xfs_force_shutdown(ip->i_mount, SHUTDOWN_META_IO_ERROR);
> > > -			return;
> > > -		}
> > > -
> > > -		/*
> > > -		 * We need an explicit buffer reference for the log item but
> > > -		 * don't want the buffer to remain attached to the transaction.
> > > -		 * Hold the buffer but release the transaction reference once
> > > -		 * we've attached the inode log item to the buffer log item
> > > -		 * list.
> > > -		 */
> > > -		xfs_buf_hold(bp);
> > > -		spin_lock(&iip->ili_lock);
> > > -		iip->ili_item.li_buf = bp;
> > > -		bp->b_flags |= _XBF_INODES;
> > > -		list_add_tail(&iip->ili_item.li_bio_list, &bp->b_li_list);
> > > -		xfs_trans_brelse(tp, bp);
> > > -	}
> > > -
> > > -	/*
> > > -	 * Always OR in the bits from the ili_last_fields field.  This is to
> > > -	 * coordinate with the xfs_iflush() and xfs_buf_inode_iodone() routines
> > > -	 * in the eventual clearing of the ili_fields bits.  See the big comment
> > > -	 * in xfs_iflush() for an explanation of this coordination mechanism.
> > > -	 */
> > > -	iip->ili_fields |= (flags | iip->ili_last_fields | iversion_flags);
> > > -	spin_unlock(&iip->ili_lock);
> > > +	iip->ili_dirty_flags |= flags;
> > > +	trace_printk("ino 0x%llx, flags 0x%x, dflags 0x%x",
> > > +		ip->i_ino, flags, iip->ili_dirty_flags);
> > 
> > Urk, leftover debugging info?
> > 
> > --D
> > >  }
> > >  
> > >  int
> > > diff --git a/fs/xfs/xfs_inode_item.c b/fs/xfs/xfs_inode_item.c
> > > index ca2941ab6cbc..586af11b7cd1 100644
> > > --- a/fs/xfs/xfs_inode_item.c
> > > +++ b/fs/xfs/xfs_inode_item.c
> > > @@ -29,6 +29,156 @@ static inline struct xfs_inode_log_item *INODE_ITEM(struct xfs_log_item *lip)
> > >  	return container_of(lip, struct xfs_inode_log_item, ili_item);
> > >  }
> > >  
> > > +static uint64_t
> > > +xfs_inode_item_sort(
> > > +	struct xfs_log_item	*lip)
> > > +{
> > > +	return INODE_ITEM(lip)->ili_inode->i_ino;
> > > +}
> > > +
> > > +/*
> > > + * Prior to finally logging the inode, we have to ensure that all the
> > > + * per-modification inode state changes are applied. This includes VFS inode
> > > + * state updates, format conversions, verifier state synchronisation and
> > > + * ensuring the inode buffer remains in memory whilst the inode is dirty.
> > > + *
> > > + * We have to be careful when we grab the inode cluster buffer due to lock
> > > + * ordering constraints. The unlinked inode modifications (xfs_iunlink_item)
> > > + * require AGI -> inode cluster buffer lock order. The inode cluster buffer is
> > > + * not locked until ->precommit, so it happens after everything else has been
> > > + * modified.
> > > + *
> > > + * Further, we have AGI -> AGF lock ordering, and with O_TMPFILE handling we
> > > + * have AGI -> AGF -> iunlink item -> inode cluster buffer lock order. Hence we
> > > + * cannot safely lock the inode cluster buffer in xfs_trans_log_inode() because
> > > + * it can be called on a inode (e.g. via bumplink/droplink) before we take the
> > > + * AGF lock modifying directory blocks.
> > > + *
> > > + * Rather than force a complete rework of all the transactions to call
> > > + * xfs_trans_log_inode() once and once only at the end of every transaction, we
> > > + * move the pinning of the inode cluster buffer to a ->precommit operation. This
> > > + * matches how the xfs_iunlink_item locks the inode cluster buffer, and it
> > > + * ensures that the inode cluster buffer locking is always done last in a
> > > + * transaction. i.e. we ensure the lock order is always AGI -> AGF -> inode
> > > + * cluster buffer.
> > > + *
> > > + * If we return the inode number as the precommit sort key then we'll also
> > > + * guarantee that the order all inode cluster buffer locking is the same all the
> > > + * inodes and unlink items in the transaction.
> > > + */
> > > +static int
> > > +xfs_inode_item_precommit(
> > > +	struct xfs_trans	*tp,
> > > +	struct xfs_log_item	*lip)
> > > +{
> > > +	struct xfs_inode_log_item *iip = INODE_ITEM(lip);
> > > +	struct xfs_inode	*ip = iip->ili_inode;
> > > +	struct inode		*inode = VFS_I(ip);
> > > +	unsigned int		flags = iip->ili_dirty_flags;
> > > +
> > > +	trace_printk("ino 0x%llx, dflags 0x%x, fields 0x%x lastf 0x%x",
> > > +		ip->i_ino, flags, iip->ili_fields, iip->ili_last_fields);
> > > +	/*
> > > +	 * Don't bother with i_lock for the I_DIRTY_TIME check here, as races
> > > +	 * don't matter - we either will need an extra transaction in 24 hours
> > > +	 * to log the timestamps, or will clear already cleared fields in the
> > > +	 * worst case.
> > > +	 */
> > > +	if (inode->i_state & I_DIRTY_TIME) {
> > > +		spin_lock(&inode->i_lock);
> > > +		inode->i_state &= ~I_DIRTY_TIME;
> > > +		spin_unlock(&inode->i_lock);
> > > +	}
> > > +
> > > +
> > > +	/*
> > > +	 * If we're updating the inode core or the timestamps and it's possible
> > > +	 * to upgrade this inode to bigtime format, do so now.
> > > +	 */
> > > +	if ((flags & (XFS_ILOG_CORE | XFS_ILOG_TIMESTAMP)) &&
> > > +	    xfs_has_bigtime(ip->i_mount) &&
> > > +	    !xfs_inode_has_bigtime(ip)) {
> > > +		ip->i_diflags2 |= XFS_DIFLAG2_BIGTIME;
> > > +		flags |= XFS_ILOG_CORE;
> > > +	}
> > > +
> > > +	/*
> > > +	 * Inode verifiers do not check that the extent size hint is an integer
> > > +	 * multiple of the rt extent size on a directory with both rtinherit
> > > +	 * and extszinherit flags set.  If we're logging a directory that is
> > > +	 * misconfigured in this way, clear the hint.
> > > +	 */
> > > +	if ((ip->i_diflags & XFS_DIFLAG_RTINHERIT) &&
> > > +	    (ip->i_diflags & XFS_DIFLAG_EXTSZINHERIT) &&
> > > +	    (ip->i_extsize % ip->i_mount->m_sb.sb_rextsize) > 0) {
> > > +		ip->i_diflags &= ~(XFS_DIFLAG_EXTSIZE |
> > > +				   XFS_DIFLAG_EXTSZINHERIT);
> > > +		ip->i_extsize = 0;
> > > +		flags |= XFS_ILOG_CORE;
> > > +	}
> > > +
> > > +	/*
> > > +	 * Record the specific change for fdatasync optimisation. This allows
> > > +	 * fdatasync to skip log forces for inodes that are only timestamp
> > > +	 * dirty. Once we've processed the XFS_ILOG_IVERSION flag, convert it
> > > +	 * to XFS_ILOG_CORE so that the actual on-disk dirty tracking
> > > +	 * (ili_fields) correctly tracks that the version has changed.
> > > +	 */
> > > +	spin_lock(&iip->ili_lock);
> > > +	iip->ili_fsync_fields |= (flags & ~XFS_ILOG_IVERSION);
> > > +	if (flags & XFS_ILOG_IVERSION)
> > > +		flags = ((flags & ~XFS_ILOG_IVERSION) | XFS_ILOG_CORE);
> > > +
> > > +	if (!iip->ili_item.li_buf) {
> > > +		struct xfs_buf	*bp;
> > > +		int		error;
> > > +
> > > +		/*
> > > +		 * We hold the ILOCK here, so this inode is not going to be
> > > +		 * flushed while we are here. Further, because there is no
> > > +		 * buffer attached to the item, we know that there is no IO in
> > > +		 * progress, so nothing will clear the ili_fields while we read
> > > +		 * in the buffer. Hence we can safely drop the spin lock and
> > > +		 * read the buffer knowing that the state will not change from
> > > +		 * here.
> > > +		 */
> > > +		spin_unlock(&iip->ili_lock);
> > > +		error = xfs_imap_to_bp(ip->i_mount, tp, &ip->i_imap, &bp);
> > > +		if (error)
> > > +			return error;
> > > +
> > > +		/*
> > > +		 * We need an explicit buffer reference for the log item but
> > > +		 * don't want the buffer to remain attached to the transaction.
> > > +		 * Hold the buffer but release the transaction reference once
> > > +		 * we've attached the inode log item to the buffer log item
> > > +		 * list.
> > > +		 */
> > > +		xfs_buf_hold(bp);
> > > +		spin_lock(&iip->ili_lock);
> > > +		iip->ili_item.li_buf = bp;
> > > +		bp->b_flags |= _XBF_INODES;
> > > +		list_add_tail(&iip->ili_item.li_bio_list, &bp->b_li_list);
> > > +		xfs_trans_brelse(tp, bp);
> > > +	}
> > > +
> > > +	/*
> > > +	 * Always OR in the bits from the ili_last_fields field.  This is to
> > > +	 * coordinate with the xfs_iflush() and xfs_buf_inode_iodone() routines
> > > +	 * in the eventual clearing of the ili_fields bits.  See the big comment
> > > +	 * in xfs_iflush() for an explanation of this coordination mechanism.
> > > +	 */
> > > +	iip->ili_fields |= (flags | iip->ili_last_fields);
> > > +	spin_unlock(&iip->ili_lock);
> > > +
> > > +	/*
> > > +	 * We are done with the log item transaction dirty state, so clear it so
> > > +	 * that it doesn't pollute future transactions.
> > > +	 */
> > > +	iip->ili_dirty_flags = 0;
> > > +	return 0;
> > > +}
> > > +
> > >  /*
> > >   * The logged size of an inode fork is always the current size of the inode
> > >   * fork. This means that when an inode fork is relogged, the size of the logged
> > > @@ -662,6 +812,8 @@ xfs_inode_item_committing(
> > >  }
> > >  
> > >  static const struct xfs_item_ops xfs_inode_item_ops = {
> > > +	.iop_sort	= xfs_inode_item_sort,
> > > +	.iop_precommit	= xfs_inode_item_precommit,
> > >  	.iop_size	= xfs_inode_item_size,
> > >  	.iop_format	= xfs_inode_item_format,
> > >  	.iop_pin	= xfs_inode_item_pin,
> > > diff --git a/fs/xfs/xfs_inode_item.h b/fs/xfs/xfs_inode_item.h
> > > index bbd836a44ff0..377e06007804 100644
> > > --- a/fs/xfs/xfs_inode_item.h
> > > +++ b/fs/xfs/xfs_inode_item.h
> > > @@ -17,6 +17,7 @@ struct xfs_inode_log_item {
> > >  	struct xfs_log_item	ili_item;	   /* common portion */
> > >  	struct xfs_inode	*ili_inode;	   /* inode ptr */
> > >  	unsigned short		ili_lock_flags;	   /* inode lock flags */
> > > +	unsigned int		ili_dirty_flags;   /* dirty in current tx */
> > >  	/*
> > >  	 * The ili_lock protects the interactions between the dirty state and
> > >  	 * the flush state of the inode log item. This allows us to do atomic
> > > -- 
> > > 2.40.1
> > > 
> > 
> 
> -- 
> Dave Chinner
> david@xxxxxxxxxxxxx