Re: [PATCH] xfs: serialize unaligned dio writes against all other dio writes

Brian Foster <bfoster@xxxxxxxxxx> · Mon, 25 Mar 2019 09:45:13 -0400

On Mon, Mar 25, 2019 at 11:47:22AM +0800, Zorro Lang wrote:
> On Sat, Mar 23, 2019 at 06:29:30PM +0800, Zorro Lang wrote:
> > On Fri, Mar 22, 2019 at 12:52:42PM -0400, Brian Foster wrote:
> > > XFS applies more strict serialization constraints to unaligned
> > > direct writes to accommodate things like direct I/O layer zeroing,
> > > unwritten extent conversion, etc. Unaligned submissions acquire the
> > > exclusive iolock and wait for in-flight dio to complete to ensure
> > > multiple submissions do not race on the same block and cause data
> > > corruption.
> > > 
> > > This generally works in the case of an aligned dio followed by an
> > > unaligned dio, but the serialization is lost if I/Os occur in the
> > > opposite order. If an unaligned write is submitted first and
> > > immediately followed by an overlapping, aligned write, the latter
> > > submits without the typical unaligned serialization barriers because
> > > there is no indication of an unaligned dio still in-flight. This can
> > > lead to unpredictable results.
> > > 
> > > To provide proper unaligned dio serialization, require that such
> > > direct writes are always the only dio allowed in-flight at one time
> > > for a particular inode. We already acquire the exclusive iolock and
> > > drain pending dio before submitting the unaligned dio. Wait once
> > > more after the dio submission to hold the iolock across the I/O and
> > > prevent further submissions until the unaligned I/O completes. This
> > > is heavy handed, but consistent with the current pre-submission
> > > serialization for unaligned direct writes.
> > > 
> > > Signed-off-by: Brian Foster <bfoster@xxxxxxxxxx>
> > > ---
> > > 
> > > I was originally going to deal with this problem by hacking in an inode
> > > flag to track unaligned dio writes in-flight and use that to block any
> > > follow on dio writes until cleared. Dave suggested we could use the
> > > iolock to serialize by converting unaligned async dio writes to sync dio
> > > writes and just letting the unaligned dio itself always block. That
> > > seemed reasonable to me, but I morphed the approach slightly to just use
> > > inode_dio_wait() because it seemed a bit cleaner. Thoughts?
> > > 
> > > Zorro,
> > > 
> > > You reproduced this problem originally. It addresses the problem in the
> > > test case that reproduced for me. Care to confirm whether this patch
> > > fixes the problem for you? Thanks.
> > 
> > Hi Brian,
> > 
> > Sure, but I can't reproduce this bug on upstream kernel. I have to merge
> > it into an older kernel(you know that:), to verify if it works.
> 
> I merged your patch into an older kernel which I can reproduce this bug.
> Then test passed.
> 

Excellent, thanks.

> BTW, Eryu said he hit this bug on upstream v5.0, on a KVM machine with LVM
> devices, when he ran the case which I sent to fstests@. So I think
> it's reproducible on upstream kernel. Just we need some conditions to trigger
> that. So if you know how to make this 'condition', please tell me, I'll
> think about if I can write anther case to cover this bug specially.
> 

Ok. I could only reproduce with your custom reproducer (over loop) on
kernels between v4.14 and v4.20 (inclusive). More specifically, I could
reproduce between commits 546e7be824 ("iomap_dio_rw: Allocate AIO
completion queue before submitting dio") and a79d40e9b0 ("aio: only use
blk plugs for > 2 depth submissions"). As discussed, these commits
mostly just alter timing and thus affect the issue indirectly.

I haven't taken a closer look at the fstest yet beyond the custom
variant you provided. I wonder if a test could reproduce this more
effectively by increasing the load of the test..? For example, can you
reproduce by running many iterations of the I/O? What about running a
deeper queue and submitting many such overlapping aligned/unaligned I/Os
all at once (to the same or different offsets of the file)? Just a
thought..

Brian

> Thanks,
> Zorro
> 
> > 
> > If upstream kernel has this issue too, do you have a better idea to reproduce
> > it on upstream? Maybe I can improve my case to cover more.
> > 
> > Thanks,
> > Zorro
> > 
> > > 
> > > Brian
> > > 
> > >  fs/xfs/xfs_file.c | 21 ++++++++++++---------
> > >  1 file changed, 12 insertions(+), 9 deletions(-)
> > > 
> > > diff --git a/fs/xfs/xfs_file.c b/fs/xfs/xfs_file.c
> > > index 770cc2edf777..8b2aaed82343 100644
> > > --- a/fs/xfs/xfs_file.c
> > > +++ b/fs/xfs/xfs_file.c
> > > @@ -529,18 +529,19 @@ xfs_file_dio_aio_write(
> > >  	count = iov_iter_count(from);
> > >  
> > >  	/*
> > > -	 * If we are doing unaligned IO, wait for all other IO to drain,
> > > -	 * otherwise demote the lock if we had to take the exclusive lock
> > > -	 * for other reasons in xfs_file_aio_write_checks.
> > > +	 * If we are doing unaligned IO, we can't allow any other IO in-flight
> > > +	 * at the same time or we risk data corruption. Wait for all other IO to
> > > +	 * drain, submit and wait for completion before we release the iolock.
> > > +	 *
> > > +	 * If the IO is aligned, demote the iolock if we had to take the
> > > +	 * exclusive lock in xfs_file_aio_write_checks() for other reasons.
> > >  	 */
> > >  	if (unaligned_io) {
> > > -		/* If we are going to wait for other DIO to finish, bail */
> > > -		if (iocb->ki_flags & IOCB_NOWAIT) {
> > > -			if (atomic_read(&inode->i_dio_count))
> > > -				return -EAGAIN;
> > > -		} else {
> > > +		/* unaligned dio always waits, bail */
> > > +		if (iocb->ki_flags & IOCB_NOWAIT)
> > > +			return -EAGAIN;
> > > +		else
> > >  			inode_dio_wait(inode);
> > > -		}
> > >  	} else if (iolock == XFS_IOLOCK_EXCL) {
> > >  		xfs_ilock_demote(ip, XFS_IOLOCK_EXCL);
> > >  		iolock = XFS_IOLOCK_SHARED;
> > > @@ -548,6 +549,8 @@ xfs_file_dio_aio_write(
> > >  
> > >  	trace_xfs_file_direct_write(ip, count, iocb->ki_pos);
> > >  	ret = iomap_dio_rw(iocb, from, &xfs_iomap_ops, xfs_dio_write_end_io);
> > > +	if (unaligned_io && !is_sync_kiocb(iocb))
> > > +		inode_dio_wait(inode);
> > >  out:
> > >  	xfs_iunlock(ip, iolock);
> > >  
> > > -- 
> > > 2.17.2
> > >