Re: [PATCH v3] xfs: Fix agi&agf ABBA deadlock when performing rename with RENAME_WHITEOUT flag

kaixuxia <xiakaixu1987@xxxxxxxxx> · Thu, 22 Aug 2019 10:07:58 +0800



On 2019/8/21 19:25, Brian Foster wrote:
> On Wed, Aug 21, 2019 at 12:46:18PM +0800, kaixuxia wrote:
>> When performing rename operation with RENAME_WHITEOUT flag, we will
>> hold AGF lock to allocate or free extents in manipulating the dirents
>> firstly, and then doing the xfs_iunlink_remove() call last to hold
>> AGI lock to modify the tmpfile info, so we the lock order AGI->AGF.
>>
>> The big problem here is that we have an ordering constraint on AGF
>> and AGI locking - inode allocation locks the AGI, then can allocate
>> a new extent for new inodes, locking the AGF after the AGI. Hence
>> the ordering that is imposed by other parts of the code is AGI before
>> AGF. So we get the ABBA agi&agf deadlock here.
>>
> ...
>>
>> In this patch we move the xfs_iunlink_remove() call to
>> before acquiring the AGF lock to preserve correct AGI/AGF locking
>> order.
>>
>> Signed-off-by: kaixuxia <kaixuxia@xxxxxxxxxxx>
>> ---
> 
> FYI, I see this when I pull in this patch:
> 
> warning: Patch sent with format=flowed; space at the end of lines might be lost.
> 
> Not sure what it means or if it matters. :P

This is my Thunderbird edit config problem, will fix it. :) 
> 
> Otherwise this looks much better to me generally. Just some nits..
> 
>>  fs/xfs/xfs_inode.c | 61 ++++++++++++++++++++++++++++++++++--------------------
>>  1 file changed, 38 insertions(+), 23 deletions(-)
>>
>> diff --git a/fs/xfs/xfs_inode.c b/fs/xfs/xfs_inode.c
>> index 6467d5e..cf06568 100644
>> --- a/fs/xfs/xfs_inode.c
>> +++ b/fs/xfs/xfs_inode.c
>> @@ -3282,7 +3282,8 @@ struct xfs_iunlink {
>>  					spaceres);
>>
>>  	/*
>> -	 * Set up the target.
>> +	 * Error checks before we dirty the transaction, return
>> +	 * the error code if check failed and the filesystem is clean.
> 
> I'm not sure what "filesystem is clean" refers to here. I think you mean
> transaction, but I'm wondering if something like the following is a bit
> more clear:
> 
> "Check for expected errors before we dirty the transaction so we can
> return an error without a transaction abort."
> 
>>  	 */
>>  	if (target_ip == NULL) {
>>  		/*
>> @@ -3294,6 +3295,40 @@ struct xfs_iunlink {
>>  			if (error)
>>  				goto out_trans_cancel;
>>  		}
>> +	} else {
>> +		/*
>> +		 * If target exists and it's a directory, check that both
>> +		 * target and source are directories and that target can be
>> +		 * destroyed, or that neither is a directory.
>> +		 */
> 
> Interesting that the existing comment refers to checking the source
> inode, but that doesn't happen in the code. That's not a bug in this
> patch, but are we missing a check here or is the comment stale?
> 
>> +		if (S_ISDIR(VFS_I(target_ip)->i_mode)) {
>> +			/*
>> +			 * Make sure target dir is empty.
>> +			 */
>> +			if (!(xfs_dir_isempty(target_ip)) ||
>> +			    (VFS_I(target_ip)->i_nlink > 2)) {
>> +				error = -EEXIST;
>> +				goto out_trans_cancel;
>> +			}
>> +		}
>> +	}
> 
> Code seems fine, but I think we could save some lines by condensing the
> logic a bit. For example:
> 
> 	/*
> 	 * ...
> 	 */
> 	if (!target_ip && !spaceres) {
> 		/* check for a no res dentry creation */
> 		error = xfs_dir_canenter();
> 		...
> 	} else if (target_ip && S_ISDIR(VFS_I(target_ip)->i_mode) &&
> 		   (!(xfs_dir_isempty(target_ip)) || 
> 		    (VFS_I(target_ip)->i_nlink > 2)))
> 		/* can't rename over a non-empty directory */
> 		error = -EEXIST;
> 		goto out_trans_cancel;
> 	}
> 
> Hm? Note that we use an 80 column limit, but we also want to expand
> short lines to that limit as much as possible and use alignment to make
> logic easier to read.
> 
>> +
>> +	/*
>> +	 * Directory entry creation below may acquire the AGF. Remove
>> +	 * the whiteout from the unlinked list first to preserve correct
>> +	 * AGI/AGF locking order.
>> +	 */
>> +	if (wip) {
>> +		ASSERT(VFS_I(wip)->i_nlink == 0);
>> +		error = xfs_iunlink_remove(tp, wip);
>> +		if (error)
>> +			goto out_trans_cancel;
>> +	}
>> +
>> +	/*
>> +	 * Set up the target.
>> +	 */
>> +	if (target_ip == NULL) {
>>  		/*
>>  		 * If target does not exist and the rename crosses
>>  		 * directories, adjust the target directory link count
>> @@ -3312,22 +3347,6 @@ struct xfs_iunlink {
>>  		}
>>  	} else { /* target_ip != NULL */
>>  		/*
>> -		 * If target exists and it's a directory, check that both
>> -		 * target and source are directories and that target can be
>> -		 * destroyed, or that neither is a directory.
>> -		 */
>> -		if (S_ISDIR(VFS_I(target_ip)->i_mode)) {
>> -			/*
>> -			 * Make sure target dir is empty.
>> -			 */
>> -			if (!(xfs_dir_isempty(target_ip)) ||
>> -			    (VFS_I(target_ip)->i_nlink > 2)) {
>> -				error = -EEXIST;
>> -				goto out_trans_cancel;
>> -			}
>> -		}
>> -
>> -		/*
>>  		 * Link the source inode under the target name.
>>  		 * If the source inode is a directory and we are moving
>>  		 * it across directories, its ".." entry will be
>> @@ -3421,16 +3440,12 @@ struct xfs_iunlink {
>>  	 * For whiteouts, we need to bump the link count on the whiteout inode.
>>  	 * This means that failures all the way up to this point leave the inode
>>  	 * on the unlinked list and so cleanup is a simple matter of dropping
>> -	 * the remaining reference to it. If we fail here after bumping the link
>> -	 * count, we're shutting down the filesystem so we'll never see the
>> -	 * intermediate state on disk.
>> +	 * the remaining reference to it. Move the xfs_iunlink_remove() call to
>> +	 * before acquiring the AGF lock to preserve correct AGI/AGF locking order.
> 
> With this change, the earlier part of this comment about failures up
> this point leaving the whiteout on the unlinked list is no longer true.
> We've already removed it earlier in the function. Also, the new bit
> about "moving" the call is confusing because it describes more what this
> patch does vs the current code.
> 
> I'd suggest a new comment that combines with the one within this branch
> (not shown in the patch). For example:
> 
>         /*
>          * For whiteouts, we need to bump the link count on the whiteout inode.
>          * This means that failures all the way up to this point leave the inode
>          * on the unlinked list and so cleanup is a simple matter of dropping
>          * the remaining reference to it. If we fail here after bumping the link
>          * count, we're shutting down the filesystem so we'll never see the
>          * intermediate state on disk.
>          */
> 
> And then remove the comment inside the branch. FWIW, you could also add
> a sentence to the earlier comment where the wip is removed like: "This
> dirties the transaction so failures after this point will abort and log
> recovery will clean up the mess."

Thanks a lot for all the comments. I will address them and send
the new patch.
> 
> Brian
> 
>>  	 */
>>  	if (wip) {
>>  		ASSERT(VFS_I(wip)->i_nlink == 0);
>>  		xfs_bumplink(tp, wip);
>> -		error = xfs_iunlink_remove(tp, wip);
>> -		if (error)
>> -			goto out_trans_cancel;
>>  		xfs_trans_log_inode(tp, wip, XFS_ILOG_CORE);
>>
>>  		/*
>> -- 
>> 1.8.3.1
>>
>> -- 
>> kaixuxia

-- 
kaixuxia