Re: [PATCH 4/4] xfs: fix AGF vs inode cluster buffer deadlock

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



On Tue, May 16, 2023 at 06:26:29PM -0700, Darrick J. Wong wrote:
> On Wed, May 17, 2023 at 10:04:49AM +1000, Dave Chinner wrote:
> > From: Dave Chinner <dchinner@xxxxxxxxxx>
> > 
> > Lock order in XFS is AGI -> AGF, hence for operations involving
> > inode unlinked list operations we always lock the AGI first. Inode
> > unlinked list operations operate on the inode cluster buffer,
> > so the lock order there is AGI -> inode cluster buffer.
> > 
> > For O_TMPFILE operations, this now means the lock order set down in
> > xfs_rename and xfs_link is AGI -> inode cluster buffer -> AGF as the
> > unlinked ops are done before the directory modifications that may
> > allocate space and lock the AGF.
> > 
> > Unfortunately, we also now lock the inode cluster buffer when
> > logging an inode so that we can attach the inode to the cluster
> > buffer and pin it in memory. This creates a lock order of AGF ->
> > inode cluster buffer in directory operations as we have to log the
> > inode after we've allocated new space for it.
> > 
> > This creates a lock inversion between the AGF and the inode cluster
> > buffer. Because the inode cluster buffer is shared across multiple
> > inodes, the inversion is not specific to individual inodes but can
> > occur when inodes in the same cluster buffer are accessed in
> > different orders.
> > 
> > To fix this we need move all the inode log item cluster buffer
> > interactions to the end of the current transaction. Unfortunately,
> > xfs_trans_log_inode() calls are littered throughout the transactions
> > with no thought to ordering against other items or locking. This
> > makes it difficult to do anything that involves changing the call
> > sites of xfs_trans_log_inode() to change locking orders.
> > 
> > However, we do now have a mechanism that allows is to postpone dirty
> > item processing to just before we commit the transaction: the
> > ->iop_precommit method. This will be called after all the
> > modifications are done and high level objects like AGI and AGF
> > buffers have been locked and modified, thereby providing a mechanism
> > that guarantees we don't lock the inode cluster buffer before those
> > high level objects are locked.
> > 
> > This change is largely moving the guts of xfs_trans_log_inode() to
> > xfs_inode_item_precommit() and providing an extra flag context in
> > the inode log item to track the dirty state of the inode in the
> > current transaction. This also means we do a lot less repeated work
> > in xfs_trans_log_inode() by only doing it once per transaction when
> > all the work is done.
> 
> Aha, and that's why you moved all the "opportunistically tweak inode
> metadata while we're already logging it" bits to the precommit hook.

Yes. It didn't make sense to move just some of the "only need to do
once per transaction while the inode is locked" stuff from one place
to the other. I figured it's better to have it all in one place and
do it all just once...

....
> > -	/*
> > -	 * Always OR in the bits from the ili_last_fields field.  This is to
> > -	 * coordinate with the xfs_iflush() and xfs_buf_inode_iodone() routines
> > -	 * in the eventual clearing of the ili_fields bits.  See the big comment
> > -	 * in xfs_iflush() for an explanation of this coordination mechanism.
> > -	 */
> > -	iip->ili_fields |= (flags | iip->ili_last_fields | iversion_flags);
> > -	spin_unlock(&iip->ili_lock);
> > +	iip->ili_dirty_flags |= flags;
> > +	trace_printk("ino 0x%llx, flags 0x%x, dflags 0x%x",
> > +		ip->i_ino, flags, iip->ili_dirty_flags);
> 
> Urk, leftover debugging info?

Yes, I just realised I'd left it there when writing the last email.
Ignoring those two trace-printk() statements, the rest of the code
should be ok to review...

-Dave.
-- 
Dave Chinner
david@xxxxxxxxxxxxx



[Index of Archives]     [XFS Filesystem Development (older mail)]     [Linux Filesystem Development]     [Linux Audio Users]     [Yosemite Trails]     [Linux Kernel]     [Linux RAID]     [Linux SCSI]


  Powered by Linux