Re: [PATCH 1/1] xfs: fix potential AGI <-> ILOCK ABBA deadlock in xrep_dinode_findmode_walk_directory

"Darrick J. Wong" <djwong@xxxxxxxxxx> · Fri, 29 Mar 2024 11:38:29 -0700

On Wed, Mar 27, 2024 at 09:56:35AM -0700, Christoph Hellwig wrote:
> > Thread 20558 holds an AGI buffer and is trying to grab the ILOCK of the
> > root directory.  Thread 20559 holds the root directory ILOCK and is
> > trying to grab the AGI of an inode that is one of the root directory's
> > children.  The AGI held by 20558 is the same buffer that 20559 is trying
> > to acquire.  In other words, this is an ABBA deadlock.
> > 
> > In general, the lock order is ILOCK and then AGI -- rename does this
> > while preparing for an operation involving whiteouts or renaming files
> > out of existence; and unlink does this when moving an inode to the
> > unlinked list.  The only place where we do it in the opposite order is
> > on the child during an icreate, but at that point the child is marked
> > INEW and is not visible to other threads.
> > 
> > Work around this deadlock by replacing the blocking ilock attempt with a
> > nonblocking loop that aborts after 30 seconds.  Relax for a jiffy after
> > a failed lock attempt.
> 
> Trylock and wait schemes are sketchy as hell.  Why do we need to hold
> the AGI lock when walking the directory?

The short answer is that we're holding the AGI to quiesce inode cache
activity in the AG containing the inode that xrep_dinode* is trying to
fix.  The goal of xrep_dinode* functions is to get the ondisk inode into
good enough shape that we can iget the inode and continue repairs with
the cached inode and all the functionality that you get with a cached
inode.

Longer answer:

When the xchk_setup_inode function fails to iget an inode, it grabs the
AGI buffer, computes the xfs_imap of the affected inode, and hands
things over to repair.  At this point, we've prevented any other threads
from trying to allocate or free an inode in that AG.

Repair uses the xfs_imap to read the inode cluster buffer, so now it
holds the top and the bottom of the inode structure.  One of two things
can happen:

1) If xrep_dinode_mode decides it doesn't need to do anything, we
continue correcting problems in the rest of the xfs_dinode, commit the
cluster buffer, and retry the untrusted iget.  Repair still holds the
AGI and the icluster buffer, so we know that nobody else could have
started a walk.  Therefore, we cannot deadlock with another thread
calling iget.

2) If xrep_dinode_mode does decide to scan the filesystem to try to
recover i_mode from ftypes, now we need to do untrusted igets of
every directory in the filesystem.  However, we still need to hold the
AGI and the cluster buffer.

Oh.  The xchk_iscan_iter in xrep_dinode_find_mode does a blocking
acquisition of every AGI in the filesystem.  If the busted inode is in a
high AGI, we'll end up taking AGIs in the wrong order.  Ok, I need to
fix that too.

--D