Re: [PATCH] xfs: don't take a spinlock unconditionally in the DIO fastpath

riteshh <riteshh@xxxxxxxxxxxxx> · Thu, 3 Jun 2021 20:24:13 +0530

On 21/06/02 09:15AM, Dave Chinner wrote:
> On Mon, May 31, 2021 at 11:28:25PM +0530, riteshh wrote:
> > On 21/05/19 11:19AM, Dave Chinner wrote:
> > > --- a/fs/xfs/xfs_file.c
> > > +++ b/fs/xfs/xfs_file.c
> > > @@ -384,21 +384,30 @@ xfs_file_write_checks(
> > >  		}
> > >  		goto restart;
> > >  	}
> > > +
> > >  	/*
> > >  	 * If the offset is beyond the size of the file, we need to zero any
> > >  	 * blocks that fall between the existing EOF and the start of this
> > > -	 * write.  If zeroing is needed and we are currently holding the
> > > -	 * iolock shared, we need to update it to exclusive which implies
> > > -	 * having to redo all checks before.
> > > +	 * write.  If zeroing is needed and we are currently holding the iolock
> > > +	 * shared, we need to update it to exclusive which implies having to
> > > +	 * redo all checks before.
> > > +	 *
> > > +	 * We need to serialise against EOF updates that occur in IO completions
> > > +	 * here. We want to make sure that nobody is changing the size while we
> > > +	 * do this check until we have placed an IO barrier (i.e.  hold the
> > > +	 * XFS_IOLOCK_EXCL) that prevents new IO from being dispatched.  The
> > > +	 * spinlock effectively forms a memory barrier once we have the
> > > +	 * XFS_IOLOCK_EXCL so we are guaranteed to see the latest EOF value and
> > > +	 * hence be able to correctly determine if we need to run zeroing.
> > >  	 *
> > > -	 * We need to serialise against EOF updates that occur in IO
> > > -	 * completions here. We want to make sure that nobody is changing the
> > > -	 * size while we do this check until we have placed an IO barrier (i.e.
> > > -	 * hold the XFS_IOLOCK_EXCL) that prevents new IO from being dispatched.
> > > -	 * The spinlock effectively forms a memory barrier once we have the
> > > -	 * XFS_IOLOCK_EXCL so we are guaranteed to see the latest EOF value
> > > -	 * and hence be able to correctly determine if we need to run zeroing.
> > > +	 * We can do an unlocked check here safely as IO completion can only
> > > +	 * extend EOF. Truncate is locked out at this point, so the EOF can
> > > +	 * not move backwards, only forwards. Hence we only need to take the
> > > +	 * slow path and spin locks when we are at or beyond the current EOF.
> > >  	 */
> > > +	if (iocb->ki_pos <= i_size_read(inode))
> > > +		goto out;
> > > +
> > >  	spin_lock(&ip->i_flags_lock);
> > >  	isize = i_size_read(inode);
> > >  	if (iocb->ki_pos > isize) {
> >
> > Hello Dave/Jan,
> >
> > Sorry about some silly queries here. But locking sometimes can get confusing and
> > needs a background context/history.
> >
> > So,
> > I was going through the XFS DIO path and I couldn't completely get this below
> > difference between xfs_file_dio_write_unaligned() v/s
> > xfs_file_dio_write_aligned() checks for taking xfs iolock (inode rwsem)
> > with different exclusivity(exclusive v/s shared).
> >
> > I in xfs_**_unaligned() function, we also check if (ki_pos + count >= isize()).
> > If yes, then we go for an exclusive iolock.
> > While in xfs_**_aligned() function, we always take shared iolock.
> >
> > Can you please help me understand why is that? In case of an extending aligned
> > write, won't we need an exclusive iolock for XFS?
>
> No. Extending the file is a slowpath operation which requires
> exclusive locking. We always take the shared lock first if we can
> because that's the normal fast path operation and so we optimise for
> that case.
>
> In the aligned DIO case, we check for sub-block EOF zeroing in
> xfs_file_write_checks(). If needed, we upgrade the lock to exclusive
> while the EOF zeroing is done. Once we return back to the aligned IO
> code, we'll demote that exclusive lock back to shared for the block
> aligned IO that we are issuing.
>
> > Or IIUC, this exclusive lock is mostly needed to prevent two sub-bock zeroing
> > from running in parallel (which if this happens could cause corruption)
> > and this can only happen with unaligned writes.
>
> The exclusive lock is needed for serialising zeroing operations,
> whether it be zeroing for EOF extension or sub-block zeroing for
> unaligned writes.
>
> The reason for the EOF checks in the unaligned case is right there
> in the comment above the EOF checks:
>
>         /*
>          * Extending writes need exclusivity because of the sub-block zeroing
>          * that the DIO code always does for partial tail blocks beyond EOF, so
>          * don't even bother trying the fast path in this case.
>          */
>
> IOWs, there is no possible "fast path" shared locking case for
> unaligned extending DIOs, so we just take the exlusive lock right
> from the start.
>
> > Whereas, I guess ext4, still does exclusive lock even with aligned extending
> > writes, possibly because of updation of inode->i_size and orphan inode
> > handling requires it to take exclusive inode rwsem.
> >
> > While for XFS inode->i_size updation happens with a different spinlock which is
> > ip->i_flags_lock.
>
> XFS is serialising DIO completion against DIO submission here rather
> than anything else to do with inode size updates which are,
> generally speaking, serialised at a higher level by various
> combinations of i_rwsem, MMAPLOCK, ILOCK and inode_dio_wait().

Thanks a lot Dave for detailed explaination about this.
This makes things quite clear now from XFS side.

-ritesh

>
> Cheers,
>
> Dave.
> --
> Dave Chinner
> david@xxxxxxxxxxxxx