Re: [PATCH 4/4] xfs: fix AGF vs inode cluster buffer deadlock

"Darrick J. Wong" <djwong@xxxxxxxxxx> · Tue, 16 May 2023 18:26:29 -0700

On Wed, May 17, 2023 at 10:04:49AM +1000, Dave Chinner wrote:
> From: Dave Chinner <dchinner@xxxxxxxxxx>
> 
> Lock order in XFS is AGI -> AGF, hence for operations involving
> inode unlinked list operations we always lock the AGI first. Inode
> unlinked list operations operate on the inode cluster buffer,
> so the lock order there is AGI -> inode cluster buffer.
> 
> For O_TMPFILE operations, this now means the lock order set down in
> xfs_rename and xfs_link is AGI -> inode cluster buffer -> AGF as the
> unlinked ops are done before the directory modifications that may
> allocate space and lock the AGF.
> 
> Unfortunately, we also now lock the inode cluster buffer when
> logging an inode so that we can attach the inode to the cluster
> buffer and pin it in memory. This creates a lock order of AGF ->
> inode cluster buffer in directory operations as we have to log the
> inode after we've allocated new space for it.
> 
> This creates a lock inversion between the AGF and the inode cluster
> buffer. Because the inode cluster buffer is shared across multiple
> inodes, the inversion is not specific to individual inodes but can
> occur when inodes in the same cluster buffer are accessed in
> different orders.
> 
> To fix this we need move all the inode log item cluster buffer
> interactions to the end of the current transaction. Unfortunately,
> xfs_trans_log_inode() calls are littered throughout the transactions
> with no thought to ordering against other items or locking. This
> makes it difficult to do anything that involves changing the call
> sites of xfs_trans_log_inode() to change locking orders.
> 
> However, we do now have a mechanism that allows is to postpone dirty
> item processing to just before we commit the transaction: the
> ->iop_precommit method. This will be called after all the
> modifications are done and high level objects like AGI and AGF
> buffers have been locked and modified, thereby providing a mechanism
> that guarantees we don't lock the inode cluster buffer before those
> high level objects are locked.
> 
> This change is largely moving the guts of xfs_trans_log_inode() to
> xfs_inode_item_precommit() and providing an extra flag context in
> the inode log item to track the dirty state of the inode in the
> current transaction. This also means we do a lot less repeated work
> in xfs_trans_log_inode() by only doing it once per transaction when
> all the work is done.

Aha, and that's why you moved all the "opportunistically tweak inode
metadata while we're already logging it" bits to the precommit hook.

> Fixes: 298f7bec503f ("xfs: pin inode backing buffer to the inode log item")
> Signed-off-by: Dave Chinner <dchinner@xxxxxxxxxx>
> ---
>  fs/xfs/libxfs/xfs_log_format.h  |   9 +-
>  fs/xfs/libxfs/xfs_trans_inode.c | 115 +++---------------------
>  fs/xfs/xfs_inode_item.c         | 152 ++++++++++++++++++++++++++++++++
>  fs/xfs/xfs_inode_item.h         |   1 +
>  4 files changed, 171 insertions(+), 106 deletions(-)
> 
> diff --git a/fs/xfs/libxfs/xfs_log_format.h b/fs/xfs/libxfs/xfs_log_format.h
> index f13e0809dc63..269573c82808 100644
> --- a/fs/xfs/libxfs/xfs_log_format.h
> +++ b/fs/xfs/libxfs/xfs_log_format.h
> @@ -324,7 +324,6 @@ struct xfs_inode_log_format_32 {
>  #define XFS_ILOG_DOWNER	0x200	/* change the data fork owner on replay */
>  #define XFS_ILOG_AOWNER	0x400	/* change the attr fork owner on replay */
>  
> -
>  /*
>   * The timestamps are dirty, but not necessarily anything else in the inode
>   * core.  Unlike the other fields above this one must never make it to disk
> @@ -333,6 +332,14 @@ struct xfs_inode_log_format_32 {
>   */
>  #define XFS_ILOG_TIMESTAMP	0x4000
>  
> +/*
> + * The version field has been changed, but not necessarily anything else of
> + * interest. This must never make it to disk - it is used purely to ensure that
> + * the inode item ->precommit operation can update the fsync flag triggers
> + * in the inode item correctly.
> + */
> +#define XFS_ILOG_IVERSION	0x8000
> +
>  #define	XFS_ILOG_NONCORE	(XFS_ILOG_DDATA | XFS_ILOG_DEXT | \
>  				 XFS_ILOG_DBROOT | XFS_ILOG_DEV | \
>  				 XFS_ILOG_ADATA | XFS_ILOG_AEXT | \
> diff --git a/fs/xfs/libxfs/xfs_trans_inode.c b/fs/xfs/libxfs/xfs_trans_inode.c
> index 8b5547073379..2d164d0588b1 100644
> --- a/fs/xfs/libxfs/xfs_trans_inode.c
> +++ b/fs/xfs/libxfs/xfs_trans_inode.c
> @@ -40,9 +40,8 @@ xfs_trans_ijoin(
>  	iip->ili_lock_flags = lock_flags;
>  	ASSERT(!xfs_iflags_test(ip, XFS_ISTALE));
>  
> -	/*
> -	 * Get a log_item_desc to point at the new item.
> -	 */
> +	/* Reset the per-tx dirty context and add the item to the tx. */
> +	iip->ili_dirty_flags = 0;
>  	xfs_trans_add_item(tp, &iip->ili_item);
>  }
>  
> @@ -76,17 +75,10 @@ xfs_trans_ichgtime(
>  /*
>   * This is called to mark the fields indicated in fieldmask as needing to be
>   * logged when the transaction is committed.  The inode must already be
> - * associated with the given transaction.
> - *
> - * The values for fieldmask are defined in xfs_inode_item.h.  We always log all
> - * of the core inode if any of it has changed, and we always log all of the
> - * inline data/extents/b-tree root if any of them has changed.
> - *
> - * Grab and pin the cluster buffer associated with this inode to avoid RMW
> - * cycles at inode writeback time. Avoid the need to add error handling to every
> - * xfs_trans_log_inode() call by shutting down on read error.  This will cause
> - * transactions to fail and everything to error out, just like if we return a
> - * read error in a dirty transaction and cancel it.
> + * associated with the given transaction. All we do here is record where the
> + * inode was dirtied and mark the transaction and inode log item dirty;
> + * everything else is done in the ->precommit log item operation after the
> + * changes in the transaction have been completed.
>   */
>  void
>  xfs_trans_log_inode(
> @@ -96,7 +88,6 @@ xfs_trans_log_inode(
>  {
>  	struct xfs_inode_log_item *iip = ip->i_itemp;
>  	struct inode		*inode = VFS_I(ip);
> -	uint			iversion_flags = 0;
>  
>  	ASSERT(iip);
>  	ASSERT(xfs_isilocked(ip, XFS_ILOCK_EXCL));
> @@ -104,18 +95,6 @@ xfs_trans_log_inode(
>  
>  	tp->t_flags |= XFS_TRANS_DIRTY;
>  
> -	/*
> -	 * Don't bother with i_lock for the I_DIRTY_TIME check here, as races
> -	 * don't matter - we either will need an extra transaction in 24 hours
> -	 * to log the timestamps, or will clear already cleared fields in the
> -	 * worst case.
> -	 */
> -	if (inode->i_state & I_DIRTY_TIME) {
> -		spin_lock(&inode->i_lock);
> -		inode->i_state &= ~I_DIRTY_TIME;
> -		spin_unlock(&inode->i_lock);
> -	}
> -
>  	/*
>  	 * First time we log the inode in a transaction, bump the inode change
>  	 * counter if it is configured for this to occur. While we have the
> @@ -128,86 +107,12 @@ xfs_trans_log_inode(
>  	if (!test_and_set_bit(XFS_LI_DIRTY, &iip->ili_item.li_flags)) {
>  		if (IS_I_VERSION(inode) &&
>  		    inode_maybe_inc_iversion(inode, flags & XFS_ILOG_CORE))
> -			iversion_flags = XFS_ILOG_CORE;
> +			flags |= XFS_ILOG_IVERSION;
>  	}
>  
> -	/*
> -	 * If we're updating the inode core or the timestamps and it's possible
> -	 * to upgrade this inode to bigtime format, do so now.
> -	 */
> -	if ((flags & (XFS_ILOG_CORE | XFS_ILOG_TIMESTAMP)) &&
> -	    xfs_has_bigtime(ip->i_mount) &&
> -	    !xfs_inode_has_bigtime(ip)) {
> -		ip->i_diflags2 |= XFS_DIFLAG2_BIGTIME;
> -		flags |= XFS_ILOG_CORE;
> -	}
> -
> -	/*
> -	 * Inode verifiers do not check that the extent size hint is an integer
> -	 * multiple of the rt extent size on a directory with both rtinherit
> -	 * and extszinherit flags set.  If we're logging a directory that is
> -	 * misconfigured in this way, clear the hint.
> -	 */
> -	if ((ip->i_diflags & XFS_DIFLAG_RTINHERIT) &&
> -	    (ip->i_diflags & XFS_DIFLAG_EXTSZINHERIT) &&
> -	    (ip->i_extsize % ip->i_mount->m_sb.sb_rextsize) > 0) {
> -		ip->i_diflags &= ~(XFS_DIFLAG_EXTSIZE |
> -				   XFS_DIFLAG_EXTSZINHERIT);
> -		ip->i_extsize = 0;
> -		flags |= XFS_ILOG_CORE;
> -	}
> -
> -	/*
> -	 * Record the specific change for fdatasync optimisation. This allows
> -	 * fdatasync to skip log forces for inodes that are only timestamp
> -	 * dirty.
> -	 */
> -	spin_lock(&iip->ili_lock);
> -	iip->ili_fsync_fields |= flags;
> -
> -	if (!iip->ili_item.li_buf) {
> -		struct xfs_buf	*bp;
> -		int		error;
> -
> -		/*
> -		 * We hold the ILOCK here, so this inode is not going to be
> -		 * flushed while we are here. Further, because there is no
> -		 * buffer attached to the item, we know that there is no IO in
> -		 * progress, so nothing will clear the ili_fields while we read
> -		 * in the buffer. Hence we can safely drop the spin lock and
> -		 * read the buffer knowing that the state will not change from
> -		 * here.
> -		 */
> -		spin_unlock(&iip->ili_lock);
> -		error = xfs_imap_to_bp(ip->i_mount, tp, &ip->i_imap, &bp);
> -		if (error) {
> -			xfs_force_shutdown(ip->i_mount, SHUTDOWN_META_IO_ERROR);
> -			return;
> -		}
> -
> -		/*
> -		 * We need an explicit buffer reference for the log item but
> -		 * don't want the buffer to remain attached to the transaction.
> -		 * Hold the buffer but release the transaction reference once
> -		 * we've attached the inode log item to the buffer log item
> -		 * list.
> -		 */
> -		xfs_buf_hold(bp);
> -		spin_lock(&iip->ili_lock);
> -		iip->ili_item.li_buf = bp;
> -		bp->b_flags |= _XBF_INODES;
> -		list_add_tail(&iip->ili_item.li_bio_list, &bp->b_li_list);
> -		xfs_trans_brelse(tp, bp);
> -	}
> -
> -	/*
> -	 * Always OR in the bits from the ili_last_fields field.  This is to
> -	 * coordinate with the xfs_iflush() and xfs_buf_inode_iodone() routines
> -	 * in the eventual clearing of the ili_fields bits.  See the big comment
> -	 * in xfs_iflush() for an explanation of this coordination mechanism.
> -	 */
> -	iip->ili_fields |= (flags | iip->ili_last_fields | iversion_flags);
> -	spin_unlock(&iip->ili_lock);
> +	iip->ili_dirty_flags |= flags;
> +	trace_printk("ino 0x%llx, flags 0x%x, dflags 0x%x",
> +		ip->i_ino, flags, iip->ili_dirty_flags);

Urk, leftover debugging info?

--D
>  }
>  
>  int
> diff --git a/fs/xfs/xfs_inode_item.c b/fs/xfs/xfs_inode_item.c
> index ca2941ab6cbc..586af11b7cd1 100644
> --- a/fs/xfs/xfs_inode_item.c
> +++ b/fs/xfs/xfs_inode_item.c
> @@ -29,6 +29,156 @@ static inline struct xfs_inode_log_item *INODE_ITEM(struct xfs_log_item *lip)
>  	return container_of(lip, struct xfs_inode_log_item, ili_item);
>  }
>  
> +static uint64_t
> +xfs_inode_item_sort(
> +	struct xfs_log_item	*lip)
> +{
> +	return INODE_ITEM(lip)->ili_inode->i_ino;
> +}
> +
> +/*
> + * Prior to finally logging the inode, we have to ensure that all the
> + * per-modification inode state changes are applied. This includes VFS inode
> + * state updates, format conversions, verifier state synchronisation and
> + * ensuring the inode buffer remains in memory whilst the inode is dirty.
> + *
> + * We have to be careful when we grab the inode cluster buffer due to lock
> + * ordering constraints. The unlinked inode modifications (xfs_iunlink_item)
> + * require AGI -> inode cluster buffer lock order. The inode cluster buffer is
> + * not locked until ->precommit, so it happens after everything else has been
> + * modified.
> + *
> + * Further, we have AGI -> AGF lock ordering, and with O_TMPFILE handling we
> + * have AGI -> AGF -> iunlink item -> inode cluster buffer lock order. Hence we
> + * cannot safely lock the inode cluster buffer in xfs_trans_log_inode() because
> + * it can be called on a inode (e.g. via bumplink/droplink) before we take the
> + * AGF lock modifying directory blocks.
> + *
> + * Rather than force a complete rework of all the transactions to call
> + * xfs_trans_log_inode() once and once only at the end of every transaction, we
> + * move the pinning of the inode cluster buffer to a ->precommit operation. This
> + * matches how the xfs_iunlink_item locks the inode cluster buffer, and it
> + * ensures that the inode cluster buffer locking is always done last in a
> + * transaction. i.e. we ensure the lock order is always AGI -> AGF -> inode
> + * cluster buffer.
> + *
> + * If we return the inode number as the precommit sort key then we'll also
> + * guarantee that the order all inode cluster buffer locking is the same all the
> + * inodes and unlink items in the transaction.
> + */
> +static int
> +xfs_inode_item_precommit(
> +	struct xfs_trans	*tp,
> +	struct xfs_log_item	*lip)
> +{
> +	struct xfs_inode_log_item *iip = INODE_ITEM(lip);
> +	struct xfs_inode	*ip = iip->ili_inode;
> +	struct inode		*inode = VFS_I(ip);
> +	unsigned int		flags = iip->ili_dirty_flags;
> +
> +	trace_printk("ino 0x%llx, dflags 0x%x, fields 0x%x lastf 0x%x",
> +		ip->i_ino, flags, iip->ili_fields, iip->ili_last_fields);
> +	/*
> +	 * Don't bother with i_lock for the I_DIRTY_TIME check here, as races
> +	 * don't matter - we either will need an extra transaction in 24 hours
> +	 * to log the timestamps, or will clear already cleared fields in the
> +	 * worst case.
> +	 */
> +	if (inode->i_state & I_DIRTY_TIME) {
> +		spin_lock(&inode->i_lock);
> +		inode->i_state &= ~I_DIRTY_TIME;
> +		spin_unlock(&inode->i_lock);
> +	}
> +
> +
> +	/*
> +	 * If we're updating the inode core or the timestamps and it's possible
> +	 * to upgrade this inode to bigtime format, do so now.
> +	 */
> +	if ((flags & (XFS_ILOG_CORE | XFS_ILOG_TIMESTAMP)) &&
> +	    xfs_has_bigtime(ip->i_mount) &&
> +	    !xfs_inode_has_bigtime(ip)) {
> +		ip->i_diflags2 |= XFS_DIFLAG2_BIGTIME;
> +		flags |= XFS_ILOG_CORE;
> +	}
> +
> +	/*
> +	 * Inode verifiers do not check that the extent size hint is an integer
> +	 * multiple of the rt extent size on a directory with both rtinherit
> +	 * and extszinherit flags set.  If we're logging a directory that is
> +	 * misconfigured in this way, clear the hint.
> +	 */
> +	if ((ip->i_diflags & XFS_DIFLAG_RTINHERIT) &&
> +	    (ip->i_diflags & XFS_DIFLAG_EXTSZINHERIT) &&
> +	    (ip->i_extsize % ip->i_mount->m_sb.sb_rextsize) > 0) {
> +		ip->i_diflags &= ~(XFS_DIFLAG_EXTSIZE |
> +				   XFS_DIFLAG_EXTSZINHERIT);
> +		ip->i_extsize = 0;
> +		flags |= XFS_ILOG_CORE;
> +	}
> +
> +	/*
> +	 * Record the specific change for fdatasync optimisation. This allows
> +	 * fdatasync to skip log forces for inodes that are only timestamp
> +	 * dirty. Once we've processed the XFS_ILOG_IVERSION flag, convert it
> +	 * to XFS_ILOG_CORE so that the actual on-disk dirty tracking
> +	 * (ili_fields) correctly tracks that the version has changed.
> +	 */
> +	spin_lock(&iip->ili_lock);
> +	iip->ili_fsync_fields |= (flags & ~XFS_ILOG_IVERSION);
> +	if (flags & XFS_ILOG_IVERSION)
> +		flags = ((flags & ~XFS_ILOG_IVERSION) | XFS_ILOG_CORE);
> +
> +	if (!iip->ili_item.li_buf) {
> +		struct xfs_buf	*bp;
> +		int		error;
> +
> +		/*
> +		 * We hold the ILOCK here, so this inode is not going to be
> +		 * flushed while we are here. Further, because there is no
> +		 * buffer attached to the item, we know that there is no IO in
> +		 * progress, so nothing will clear the ili_fields while we read
> +		 * in the buffer. Hence we can safely drop the spin lock and
> +		 * read the buffer knowing that the state will not change from
> +		 * here.
> +		 */
> +		spin_unlock(&iip->ili_lock);
> +		error = xfs_imap_to_bp(ip->i_mount, tp, &ip->i_imap, &bp);
> +		if (error)
> +			return error;
> +
> +		/*
> +		 * We need an explicit buffer reference for the log item but
> +		 * don't want the buffer to remain attached to the transaction.
> +		 * Hold the buffer but release the transaction reference once
> +		 * we've attached the inode log item to the buffer log item
> +		 * list.
> +		 */
> +		xfs_buf_hold(bp);
> +		spin_lock(&iip->ili_lock);
> +		iip->ili_item.li_buf = bp;
> +		bp->b_flags |= _XBF_INODES;
> +		list_add_tail(&iip->ili_item.li_bio_list, &bp->b_li_list);
> +		xfs_trans_brelse(tp, bp);
> +	}
> +
> +	/*
> +	 * Always OR in the bits from the ili_last_fields field.  This is to
> +	 * coordinate with the xfs_iflush() and xfs_buf_inode_iodone() routines
> +	 * in the eventual clearing of the ili_fields bits.  See the big comment
> +	 * in xfs_iflush() for an explanation of this coordination mechanism.
> +	 */
> +	iip->ili_fields |= (flags | iip->ili_last_fields);
> +	spin_unlock(&iip->ili_lock);
> +
> +	/*
> +	 * We are done with the log item transaction dirty state, so clear it so
> +	 * that it doesn't pollute future transactions.
> +	 */
> +	iip->ili_dirty_flags = 0;
> +	return 0;
> +}
> +
>  /*
>   * The logged size of an inode fork is always the current size of the inode
>   * fork. This means that when an inode fork is relogged, the size of the logged
> @@ -662,6 +812,8 @@ xfs_inode_item_committing(
>  }
>  
>  static const struct xfs_item_ops xfs_inode_item_ops = {
> +	.iop_sort	= xfs_inode_item_sort,
> +	.iop_precommit	= xfs_inode_item_precommit,
>  	.iop_size	= xfs_inode_item_size,
>  	.iop_format	= xfs_inode_item_format,
>  	.iop_pin	= xfs_inode_item_pin,
> diff --git a/fs/xfs/xfs_inode_item.h b/fs/xfs/xfs_inode_item.h
> index bbd836a44ff0..377e06007804 100644
> --- a/fs/xfs/xfs_inode_item.h
> +++ b/fs/xfs/xfs_inode_item.h
> @@ -17,6 +17,7 @@ struct xfs_inode_log_item {
>  	struct xfs_log_item	ili_item;	   /* common portion */
>  	struct xfs_inode	*ili_inode;	   /* inode ptr */
>  	unsigned short		ili_lock_flags;	   /* inode lock flags */
> +	unsigned int		ili_dirty_flags;   /* dirty in current tx */
>  	/*
>  	 * The ili_lock protects the interactions between the dirty state and
>  	 * the flush state of the inode log item. This allows us to do atomic
> -- 
> 2.40.1
>