Re: [PATCH] xfs: Fix agi&agf ABBA deadlock when performing rename with RENAME_WHITEOUT flag

Dave Chinner <david@xxxxxxxxxxxxx> · Fri, 16 Aug 2019 09:10:01 +1000

On Tue, Aug 13, 2019 at 09:36:14AM -0400, Brian Foster wrote:
> On Tue, Aug 13, 2019 at 07:17:33PM +0800, kaixuxia wrote:
> > When performing rename operation with RENAME_WHITEOUT flag, we will
> > hold AGF lock to allocate or free extents in manipulating the dirents
> > firstly, and then doing the xfs_iunlink_remove() call last to hold
> > AGI lock to modify the tmpfile info, so we the lock order AGI->AGF.
> > 
> 
> IIUC, the whiteout use case is that we're renaming a file, but the
> source dentry must be replaced with a magic whiteout inode rather than
> be removed. Therefore, xfs_rename() allocates the whiteout inode as a
> tmpfile first in a separate transaction, updates the target dentry with
> the source inode, replaces the source dentry to point to the whiteout
> inode and finally removes the whiteout inode from the unlinked list
> (since it is a tmpfile). This leads to the problem described below
> because the rename transaction ends up doing directory block allocs
> (locking the AGF) followed by the unlinked list remove (locking the
> AGI).
> 
> My understanding from reading the code is that this is primarly to
> cleanly handle error scenarios. If anything fails after we've allocated
> the whiteout tmpfile, it's simply left on the unlinked list and so the
> filesystem remains in a consistent/recoverable state. Given that, the
> solution here seems like overkill to me. For one, I thought background
> unlinked list removal was already on our roadmap (Darrick might have
> been looking at that and may already have a prototype as well). Also,
> unlinked list removal occurs at log recovery time already. That's
> somewhat of an existing purpose of the list, which makes a deferred
> unlinked list removal operation superfluous in more traditional cases
> where unlinked list removal doesn't require consistency with a directory
> operation.
> 
> Functional discussion aside.. from a complexity standpoint I'm wondering
> if we could do something much more simple like acquire the AGI lock for
> a whiteout inode earlier in xfs_rename(). For example, suppose we did
> something like:
> 
> 	/*
> 	 * Acquire the whiteout agi to preserve locking order in anticipation of
> 	 * unlinked list removal.
> 	 */
> 	if (wip)
> 		xfs_read_agi(mp, tp, XFS_INO_TO_AGNO(mp, wip->i_ino), &agibp);
> 
> ... after we allocate the transaction but before we do any directory ops
> that can result in block allocations. Would that prevent the problem
> you've observed?

I'd prefer that we just do things in an order that doesn't invert
the locking. For a whiteout, we only allocate blocks when modifying
the target directory, and we do a check to see if that will succeed
before actually doing the directory modification. That means the
directory modification will only fail due to an IO error or
corruption, both of which have a high probability of causing the
filesystem to be shut down. Any error after the directory mod will
cause a shutdown because the transaction is dirty.

Further, the operation that will lock the AGF is the target
directory modification if blocks need to be allocated, and the whole
point of the "check before execution" is to abort if ENOSPC would
occur as a result of trying to allocate blocks and we don't have a
space reservation for for those blocks because we are very, very
close to ENOSPC already.

If we fail the xfs_iunlink_remove() operation, we're shutting down
the filesystem. If we fail the xfs_dir_createname(target) call, we
are most likely going to be shutting down the filesystem. So the
premise that locating xfs_iunlink_remove() at the end to make error
handling easy is not really true - transaction cancel will clean
both of them up and shut the filesystem down.

Hence I think the right thing to do is to move the
xfs_iunlink_remove() call to between xfs_dir_canenter() and
xfs_dir_createname(). This means ENOSPC will abort with a clean
transaction and all is good, otherwise a failure is most likely
going to shut down the filesystem and it doesn't matter if we do
xfs_iunlink_remove() or xfs_dir_createname() first.

And by doing xfs_iunlink_remove() first, we remove the AGI/AGF
lock inversion problem....

I think this holds together, but I might have missed something in
the tangle of different rename operation cases. So it's worth
checking, but it looks to me like a better solution than having
a bare AGI lock in the middle of the function to work around error
handling logic we didn't clearly enough about at the time (hindsight
and all that jazz)....

Thoughts?

-Dave.
-- 
Dave Chinner
david@xxxxxxxxxxxxx