Re: [PATCH] xfs: don't take a spinlock unconditionally in the DIO fastpath

Carlos Maiolino <cmaiolino@xxxxxxxxxx> · Wed, 19 May 2021 09:59:29 +0200

On Wed, May 19, 2021 at 11:19:20AM +1000, Dave Chinner wrote:
> From: Dave Chinner <dchinner@xxxxxxxxxx>
> 
> Because this happens at high thread counts on high IOPS devices
> doing mixed read/write AIO-DIO to a single file at about a million
> iops:
> 
>    64.09%     0.21%  [kernel]            [k] io_submit_one
>    - 63.87% io_submit_one
>       - 44.33% aio_write
>          - 42.70% xfs_file_write_iter
>             - 41.32% xfs_file_dio_write_aligned
>                - 25.51% xfs_file_write_checks
>                   - 21.60% _raw_spin_lock
>                      - 21.59% do_raw_spin_lock
>                         - 19.70% __pv_queued_spin_lock_slowpath
> 
> This also happens of the IO completion IO path:
> 
>    22.89%     0.69%  [kernel]            [k] xfs_dio_write_end_io
>    - 22.49% xfs_dio_write_end_io
>       - 21.79% _raw_spin_lock
>          - 20.97% do_raw_spin_lock
>             - 20.10% __pv_queued_spin_lock_slowpath                                                                                                            ▒
> 
> IOWs, fio is burning ~14 whole CPUs on this spin lock.
> 
> So, do an unlocked check against inode size first, then if we are
> at/beyond EOF, take the spinlock and recheck. This makes the
> spinlock disappear from the overwrite fastpath.
> 
> I'd like to report that fixing this makes things go faster.

maybe you meant this does not make things go faster?

> It
> doesn't - it just exposes the the XFS_ILOCK as the next severe
> contention point doing extent mapping lookups, and that now burns
> all the 14 CPUs this spinlock was burning.
> 
> Signed-off-by: Dave Chinner <dchinner@xxxxxxxxxx>

The patch looks good, and the comments about why it's safe to not take the
spinlock (specially why the EOF can't be moved back) is much welcomed.

Feel free to add:
Reviewed-by: Carlos Maiolino <cmaiolino@xxxxxxxxxx>

> ---
>  fs/xfs/xfs_file.c | 42 +++++++++++++++++++++++++++++++-----------
>  1 file changed, 31 insertions(+), 11 deletions(-)
> 
> diff --git a/fs/xfs/xfs_file.c b/fs/xfs/xfs_file.c
> index 396ef36dcd0a..c068dcd414f4 100644
> --- a/fs/xfs/xfs_file.c
> +++ b/fs/xfs/xfs_file.c
> @@ -384,21 +384,30 @@ xfs_file_write_checks(
>  		}
>  		goto restart;
>  	}
> +
>  	/*
>  	 * If the offset is beyond the size of the file, we need to zero any
>  	 * blocks that fall between the existing EOF and the start of this
> -	 * write.  If zeroing is needed and we are currently holding the
> -	 * iolock shared, we need to update it to exclusive which implies
> -	 * having to redo all checks before.
> +	 * write.  If zeroing is needed and we are currently holding the iolock
> +	 * shared, we need to update it to exclusive which implies having to
> +	 * redo all checks before.
> +	 *
> +	 * We need to serialise against EOF updates that occur in IO completions
> +	 * here. We want to make sure that nobody is changing the size while we
> +	 * do this check until we have placed an IO barrier (i.e.  hold the
> +	 * XFS_IOLOCK_EXCL) that prevents new IO from being dispatched.  The
> +	 * spinlock effectively forms a memory barrier once we have the
> +	 * XFS_IOLOCK_EXCL so we are guaranteed to see the latest EOF value and
> +	 * hence be able to correctly determine if we need to run zeroing.
>  	 *
> -	 * We need to serialise against EOF updates that occur in IO
> -	 * completions here. We want to make sure that nobody is changing the
> -	 * size while we do this check until we have placed an IO barrier (i.e.
> -	 * hold the XFS_IOLOCK_EXCL) that prevents new IO from being dispatched.
> -	 * The spinlock effectively forms a memory barrier once we have the
> -	 * XFS_IOLOCK_EXCL so we are guaranteed to see the latest EOF value
> -	 * and hence be able to correctly determine if we need to run zeroing.
> +	 * We can do an unlocked check here safely as IO completion can only
> +	 * extend EOF. Truncate is locked out at this point, so the EOF can
> +	 * not move backwards, only forwards. Hence we only need to take the
> +	 * slow path and spin locks when we are at or beyond the current EOF.
>  	 */
> +	if (iocb->ki_pos <= i_size_read(inode))
> +		goto out;
> +
>  	spin_lock(&ip->i_flags_lock);
>  	isize = i_size_read(inode);
>  	if (iocb->ki_pos > isize) {
> @@ -426,7 +435,7 @@ xfs_file_write_checks(
>  			drained_dio = true;
>  			goto restart;
>  		}
> -	
> +
>  		trace_xfs_zero_eof(ip, isize, iocb->ki_pos - isize);
>  		error = iomap_zero_range(inode, isize, iocb->ki_pos - isize,
>  				NULL, &xfs_buffered_write_iomap_ops);
> @@ -435,6 +444,7 @@ xfs_file_write_checks(
>  	} else
>  		spin_unlock(&ip->i_flags_lock);
>  
> +out:
>  	return file_modified(file);
>  }
>  
> @@ -500,7 +510,17 @@ xfs_dio_write_end_io(
>  	 * other IO completions here to update the EOF. Failing to serialise
>  	 * here can result in EOF moving backwards and Bad Things Happen when
>  	 * that occurs.
> +	 *
> +	 * As IO completion only ever extends EOF, we can do an unlocked check
> +	 * here to avoid taking the spinlock. If we land within the current EOF,
> +	 * then we do not need to do an extending update at all, and we don't
> +	 * need to take the lock to check this. If we race with an update moving
> +	 * EOF, then we'll either still be beyond EOF and need to take the lock,
> +	 * or we'll be within EOF and we don't need to take it at all.
>  	 */
> +	if (offset + size <= i_size_read(inode))
> +		goto out;
> +
>  	spin_lock(&ip->i_flags_lock);
>  	if (offset + size > i_size_read(inode)) {
>  		i_size_write(inode, offset + size);
> -- 
> 2.31.1
> 

-- 
Carlos