Re: [PATCH] xfs: don't take a spinlock unconditionally in the DIO fastpath

Dave Chinner <david@xxxxxxxxxxxxx> · Wed, 2 Jun 2021 09:15:20 +1000

On Mon, May 31, 2021 at 11:28:25PM +0530, riteshh wrote:
> On 21/05/19 11:19AM, Dave Chinner wrote:
> > --- a/fs/xfs/xfs_file.c
> > +++ b/fs/xfs/xfs_file.c
> > @@ -384,21 +384,30 @@ xfs_file_write_checks(
> >  		}
> >  		goto restart;
> >  	}
> > +
> >  	/*
> >  	 * If the offset is beyond the size of the file, we need to zero any
> >  	 * blocks that fall between the existing EOF and the start of this
> > -	 * write.  If zeroing is needed and we are currently holding the
> > -	 * iolock shared, we need to update it to exclusive which implies
> > -	 * having to redo all checks before.
> > +	 * write.  If zeroing is needed and we are currently holding the iolock
> > +	 * shared, we need to update it to exclusive which implies having to
> > +	 * redo all checks before.
> > +	 *
> > +	 * We need to serialise against EOF updates that occur in IO completions
> > +	 * here. We want to make sure that nobody is changing the size while we
> > +	 * do this check until we have placed an IO barrier (i.e.  hold the
> > +	 * XFS_IOLOCK_EXCL) that prevents new IO from being dispatched.  The
> > +	 * spinlock effectively forms a memory barrier once we have the
> > +	 * XFS_IOLOCK_EXCL so we are guaranteed to see the latest EOF value and
> > +	 * hence be able to correctly determine if we need to run zeroing.
> >  	 *
> > -	 * We need to serialise against EOF updates that occur in IO
> > -	 * completions here. We want to make sure that nobody is changing the
> > -	 * size while we do this check until we have placed an IO barrier (i.e.
> > -	 * hold the XFS_IOLOCK_EXCL) that prevents new IO from being dispatched.
> > -	 * The spinlock effectively forms a memory barrier once we have the
> > -	 * XFS_IOLOCK_EXCL so we are guaranteed to see the latest EOF value
> > -	 * and hence be able to correctly determine if we need to run zeroing.
> > +	 * We can do an unlocked check here safely as IO completion can only
> > +	 * extend EOF. Truncate is locked out at this point, so the EOF can
> > +	 * not move backwards, only forwards. Hence we only need to take the
> > +	 * slow path and spin locks when we are at or beyond the current EOF.
> >  	 */
> > +	if (iocb->ki_pos <= i_size_read(inode))
> > +		goto out;
> > +
> >  	spin_lock(&ip->i_flags_lock);
> >  	isize = i_size_read(inode);
> >  	if (iocb->ki_pos > isize) {
> 
> Hello Dave/Jan,
> 
> Sorry about some silly queries here. But locking sometimes can get confusing and
> needs a background context/history.
> 
> So,
> I was going through the XFS DIO path and I couldn't completely get this below
> difference between xfs_file_dio_write_unaligned() v/s
> xfs_file_dio_write_aligned() checks for taking xfs iolock (inode rwsem)
> with different exclusivity(exclusive v/s shared).
> 
> I in xfs_**_unaligned() function, we also check if (ki_pos + count >= isize()).
> If yes, then we go for an exclusive iolock.
> While in xfs_**_aligned() function, we always take shared iolock.
> 
> Can you please help me understand why is that? In case of an extending aligned
> write, won't we need an exclusive iolock for XFS?

No. Extending the file is a slowpath operation which requires
exclusive locking. We always take the shared lock first if we can
because that's the normal fast path operation and so we optimise for
that case.

In the aligned DIO case, we check for sub-block EOF zeroing in
xfs_file_write_checks(). If needed, we upgrade the lock to exclusive
while the EOF zeroing is done. Once we return back to the aligned IO
code, we'll demote that exclusive lock back to shared for the block
aligned IO that we are issuing.

> Or IIUC, this exclusive lock is mostly needed to prevent two sub-bock zeroing
> from running in parallel (which if this happens could cause corruption)
> and this can only happen with unaligned writes.

The exclusive lock is needed for serialising zeroing operations,
whether it be zeroing for EOF extension or sub-block zeroing for
unaligned writes.

The reason for the EOF checks in the unaligned case is right there
in the comment above the EOF checks:

        /*
         * Extending writes need exclusivity because of the sub-block zeroing
         * that the DIO code always does for partial tail blocks beyond EOF, so
         * don't even bother trying the fast path in this case.
         */

IOWs, there is no possible "fast path" shared locking case for
unaligned extending DIOs, so we just take the exlusive lock right
from the start.

> Whereas, I guess ext4, still does exclusive lock even with aligned extending
> writes, possibly because of updation of inode->i_size and orphan inode
> handling requires it to take exclusive inode rwsem.
> 
> While for XFS inode->i_size updation happens with a different spinlock which is
> ip->i_flags_lock.

XFS is serialising DIO completion against DIO submission here rather
than anything else to do with inode size updates which are,
generally speaking, serialised at a higher level by various
combinations of i_rwsem, MMAPLOCK, ILOCK and inode_dio_wait().

Cheers,

Dave.
-- 
Dave Chinner
david@xxxxxxxxxxxxx