Re: [PATCH] xfs: recheck appropriateness of map_shared lock

Dave Chinner <david@xxxxxxxxxxxxx> · Thu, 19 Jan 2023 16:14:11 +1100

On Wed, Jan 18, 2023 at 05:24:58PM -0800, Darrick J. Wong wrote:
> From: Darrick J. Wong <djwong@xxxxxxxxxx>
> 
> While fuzzing the data fork extent count on a btree-format directory
> with xfs/375, I observed the following (excerpted) splat:
> 
> XFS: Assertion failed: xfs_isilocked(ip, XFS_ILOCK_EXCL), file: fs/xfs/libxfs/xfs_bmap.c, line: 1208
> ------------[ cut here ]------------
> WARNING: CPU: 0 PID: 43192 at fs/xfs/xfs_message.c:104 assfail+0x46/0x4a [xfs]
> Call Trace:
>  <TASK>
>  xfs_iread_extents+0x1af/0x210 [xfs 09f66509ece4938760fac7de64732a0cbd3e39cd]
>  xchk_dir_walk+0xb8/0x190 [xfs 09f66509ece4938760fac7de64732a0cbd3e39cd]
>  xchk_parent_count_parent_dentries+0x41/0x80 [xfs 09f66509ece4938760fac7de64732a0cbd3e39cd]
>  xchk_parent_validate+0x199/0x2e0 [xfs 09f66509ece4938760fac7de64732a0cbd3e39cd]
>  xchk_parent+0xdf/0x130 [xfs 09f66509ece4938760fac7de64732a0cbd3e39cd]
>  xfs_scrub_metadata+0x2b8/0x730 [xfs 09f66509ece4938760fac7de64732a0cbd3e39cd]
>  xfs_scrubv_metadata+0x38b/0x4d0 [xfs 09f66509ece4938760fac7de64732a0cbd3e39cd]
>  xfs_ioc_scrubv_metadata+0x111/0x160 [xfs 09f66509ece4938760fac7de64732a0cbd3e39cd]
>  xfs_file_ioctl+0x367/0xf50 [xfs 09f66509ece4938760fac7de64732a0cbd3e39cd]
>  __x64_sys_ioctl+0x82/0xa0
>  do_syscall_64+0x2b/0x80
>  entry_SYSCALL_64_after_hwframe+0x46/0xb0
> 
> The cause of this is a race condition in xfs_ilock_data_map_shared,
> which performs an unlocked access to the data fork to guess which lock
> mode it needs:
> 
> Thread 0                          Thread 1
> 
> xfs_need_iread_extents
> <observe no iext tree>
> xfs_ilock(..., ILOCK_EXCL)
> xfs_iread_extents
> <observe no iext tree>
> <check ILOCK_EXCL>
> <load bmbt extents into iext>
> <notice iext size doesn't
>  match nextents>
>                                   xfs_need_iread_extents
>                                   <observe iext tree>
>                                   xfs_ilock(..., ILOCK_SHARED)
> <tear down iext tree>
> xfs_iunlock(..., ILOCK_EXCL)
>                                   xfs_iread_extents
>                                   <observe no iext tree>
>                                   <check ILOCK_EXCL>
>                                   *BOOM*
> 
> mitigate this race by having thread 1 to recheck xfs_need_iread_extents
> after taking the shared ILOCK.  If the iext tree isn't present, then we
> need to upgrade to the exclusive ILOCK to try to load the bmbt.

Yup, I see the problem - this check is failing:

        if (XFS_IS_CORRUPT(mp, ir.loaded != ifp->if_nextents)) {
                error = -EFSCORRUPTED;
                goto out;
        }

and that results in calling xfs_iext_destroy() to tear down the
extent tree.

But we know the BMBT is corrupted and the extent list cannot be read
until the corruption is fixed. IOWs, we can't access any data in the
inode no matter how we lock it until the corruption is repaired.

> 
> Signed-off-by: Darrick J. Wong <djwong@xxxxxxxxxx>
> ---
>  fs/xfs/xfs_inode.c |   29 +++++++++++++++++++++++++++++
>  1 file changed, 29 insertions(+)
> 
> diff --git a/fs/xfs/xfs_inode.c b/fs/xfs/xfs_inode.c
> index d354ea2b74f9..6ce1e0e9f256 100644
> --- a/fs/xfs/xfs_inode.c
> +++ b/fs/xfs/xfs_inode.c
> @@ -117,6 +117,20 @@ xfs_ilock_data_map_shared(
>  	if (xfs_need_iread_extents(&ip->i_df))
>  		lock_mode = XFS_ILOCK_EXCL;
>  	xfs_ilock(ip, lock_mode);
> +
> +	/*
> +	 * It's possible that the unlocked access of the data fork to determine
> +	 * the lock mode could have raced with another thread that was failing
> +	 * to load the bmbt but hadn't yet torn down the iext tree.  Recheck
> +	 * the lock mode and upgrade to an exclusive lock if we need to.
> +	 */
> +	if (lock_mode == XFS_ILOCK_SHARED &&
> +	    xfs_need_iread_extents(&ip->i_df)) {
> +		xfs_iunlock(ip, lock_mode);
> +		lock_mode = XFS_ILOCK_EXCL;
> +		xfs_ilock(ip, lock_mode);
> +	}

.... and this makes me cringe. :/

If we hit this race condition, re-reading the extent list from disk
isn't going to fix the corruption, so I don't see much point in
papering over the problem just by changing the locking and failing
to read in the extent list again and returning -EFSCORRUPTED to the
operation.

So.... shouldn't we mark the inode as sick when we detect the extent
list corruption issue? i.e. before destroying the iext tree, calling
xfs_inode_mark_sick(XFS_SICK_INO_BMBTD) (or BMBTA, depending on the
fork being read) so that there is a record of the BMBT being
corrupt?

That would mean that this path simply becomes:

	if (ip->i_sick & XFS_SICK_INO_BMBTD) {
		xfs_iunlock(ip, lock_mode);
		return -EFSCORRUPTED;
	}

Which is now pretty clear that we there's no point continuing
because we can't read in the extent list, and in doing so we've
removed the race condition caused by temporarily filling the in-core
extent list.

> +
>  	return lock_mode;
>  }
>  
> @@ -129,6 +143,21 @@ xfs_ilock_attr_map_shared(
>  	if (xfs_inode_has_attr_fork(ip) && xfs_need_iread_extents(&ip->i_af))
>  		lock_mode = XFS_ILOCK_EXCL;
>  	xfs_ilock(ip, lock_mode);
> +
> +	/*
> +	 * It's possible that the unlocked access of the attr fork to determine
> +	 * the lock mode could have raced with another thread that was failing
> +	 * to load the bmbt but hadn't yet torn down the iext tree.  Recheck
> +	 * the lock mode and upgrade to an exclusive lock if we need to.
> +	 */
> +	if (lock_mode == XFS_ILOCK_SHARED &&
> +	    xfs_inode_has_attr_fork(ip) &&
> +	    xfs_need_iread_extents(&ip->i_af)) {
> +		xfs_iunlock(ip, lock_mode);
> +		lock_mode = XFS_ILOCK_EXCL;
> +		xfs_ilock(ip, lock_mode);
> +	}

And this can just check for XFS_SICK_INO_BMBTA instead...

Cheers,

Dave.
-- 
Dave Chinner
david@xxxxxxxxxxxxx