Re: [PATCH V2] xfs: Fix agi&agf ABBA deadlock when performing rename with RENAME_WHITEOUT flag

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



On Mon, Aug 19, 2019 at 11:13:35AM -0400, Brian Foster wrote:
> On Mon, Aug 19, 2019 at 09:06:39PM +0800, kaixuxia wrote:
> > When performing rename operation with RENAME_WHITEOUT flag, we will
> > hold AGF lock to allocate or free extents in manipulating the dirents
> > firstly, and then doing the xfs_iunlink_remove() call last to hold
> > AGI lock to modify the tmpfile info, so we the lock order AGI->AGF.
> > 
> > The big problem here is that we have an ordering constraint on AGF
> > and AGI locking - inode allocation locks the AGI, then can allocate
> > a new extent for new inodes, locking the AGF after the AGI. Hence
> > the ordering that is imposed by other parts of the code is AGI before
> > AGF. So we get the ABBA agi&agf deadlock here.
> > 
> ...
> > 
> > In this patch we move the xfs_iunlink_remove() call to between
> > xfs_dir_canenter() and xfs_dir_createname(). By doing xfs_iunlink
> > _remove() firstly, we remove the AGI/AGF lock inversion problem.
> > 
> > Signed-off-by: kaixuxia <kaixuxia@xxxxxxxxxxx>
> > ---
> >  fs/xfs/xfs_inode.c | 20 +++++++++++++++++---
> >  1 file changed, 17 insertions(+), 3 deletions(-)
> > 
> > diff --git a/fs/xfs/xfs_inode.c b/fs/xfs/xfs_inode.c
> > index 6467d5e..48691f2 100644
> > --- a/fs/xfs/xfs_inode.c
> > +++ b/fs/xfs/xfs_inode.c
> > @@ -3294,6 +3294,18 @@ struct xfs_iunlink {
> >  			if (error)
> >  				goto out_trans_cancel;
> >  		}
> > +
> > +		/*
> > +		 * Handle the whiteout inode and acquire the AGI lock, so
> > +		 * fix the AGI/AGF lock inversion problem.
> > +		 */
> 
> The comment could be a little more specific. For example:
> 
> "Directory entry creation may acquire the AGF. Remove the whiteout from
> the unlinked list first to preserve correct AGI/AGF locking order."
> 
> > +		if (wip) {
> > +			ASSERT(VFS_I(wip)->i_nlink == 0);
> > +			error = xfs_iunlink_remove(tp, wip);
> > +			if (error)
> > +				goto out_trans_cancel;
> > +		}
> > +
> >  		/*
> >  		 * If target does not exist and the rename crosses
> >  		 * directories, adjust the target directory link count
> > @@ -3428,9 +3440,11 @@ struct xfs_iunlink {
> >  	if (wip) {
> >  		ASSERT(VFS_I(wip)->i_nlink == 0);
> >  		xfs_bumplink(tp, wip);
> > -		error = xfs_iunlink_remove(tp, wip);
> > -		if (error)
> > -			goto out_trans_cancel;
> > +		if (target_ip != NULL) {
> > +			error = xfs_iunlink_remove(tp, wip);
> > +			if (error)
> > +				goto out_trans_cancel;
> > +		}
> 
> The comment above this hunk needs to be updated. I'm also not a big fan
> of this factoring of doing the removal in the if branch above and then
> encoding the else logic down here. It might be cleaner and more
> consistent to have a call in each branch of the if/else above.
> 
> FWIW, I'm also curious if this could be cleaned up further by pulling
> the -ENOSPC/-EEXIST checks out of the earlier branch, following that
> with the whiteout removal, and then doing the dir_create/replace. For
> example, something like:
> 
> 	/* error checks before we dirty the transaction */
> 	if (!target_ip && !spaceres) {
> 		error = xfs_dir_canenter();
> 		...
> 	} else if (S_ISDIR() && !(empty || nlink > 2))
> 		error = -EEXIST;
> 		...
> 	}
> 
> 	if (wip) {
> 		...
> 		xfs_iunlink_remove();
> 	}
> 
> 	if (!target_ip) {
> 		xfs_dir_create();
> 		...
> 	} else {
> 		xfs_dir_replace();
> 		...
> 	}
> 
> ... but that may not be any cleaner..? It could also be done as a
> followup cleanup patch as well.

Makes sense, but the more I look at this, the more I think we should
get rid of on-disk whiteout inodes altogether and finally make use
of XFS_DIR3_FT_WHT in the dirent file type.

The whiteout inode is a nasty VFS level hack to support whiteouts on
filesystems that have no way of storing whiteouts natively. i.e. a
whiteout inode is a chardev with a mode of 0 and a device number of
0. We only need to present that magic indoe to userspace, we don't
need to actually store it on disk.  IOWs, we could get rid of this
whole problem with XFS_DIR3_FT_WHT.

XFS_DIR3_FT_WHT means we no longer need to allocate a tmpfile for
the whiteout, we don't have xfs_iunlink_remove calls in rename, and
all we have to do is a xfs_dir_replace() call to zero out the inode
number and change the ftype in the dirent.

It will need changes to the xfs_lookup code to instantiate the
special chardev inode in memory when a whiteout is found, and
probably checks to ensure we never dirty such an inode. Giving it an
invalid inode number will be sufficient to prevent it from being
written to disk.

It's a bit more work, but I think killing on-disk whiteout inode is
the eventual solution we want here...

Cheers,

Dave.
-- 
Dave Chinner
david@xxxxxxxxxxxxx



[Index of Archives]     [XFS Filesystem Development (older mail)]     [Linux Filesystem Development]     [Linux Audio Users]     [Yosemite Trails]     [Linux Kernel]     [Linux RAID]     [Linux SCSI]


  Powered by Linux