Re: [PATCH 04/15] iomap: Call inode_dio_end() before generic_write_sync()

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



On Tue, Sep 22, 2020 at 09:31:56AM -0700, Darrick J. Wong wrote:
> On Tue, Sep 22, 2020 at 10:20:11AM -0400, Josef Bacik wrote:
> > On 9/21/20 10:43 AM, Goldwyn Rodrigues wrote:
> > > From: Goldwyn Rodrigues <rgoldwyn@xxxxxxxx>
> > > 
> > > iomap complete routine can deadlock with btrfs_fallocate because of the
> > > call to generic_write_sync().
> > > 
> > > P0                      P1
> > > inode_lock()            fallocate(FALLOC_FL_ZERO_RANGE)
> > > __iomap_dio_rw()        inode_lock()
> > >                          <block>
> > > <submits IO>
> > > <completes IO>
> > > inode_unlock()
> > >                          <gets inode_lock()>
> > >                          inode_dio_wait()
> > > iomap_dio_complete()
> > >    generic_write_sync()
> > >      btrfs_file_fsync()
> > >        inode_lock()
> > >        <deadlock>
> > > 
> > > inode_dio_end() is used to notify the end of DIO data in order
> > > to synchronize with truncate. Call inode_dio_end() before calling
> > > generic_write_sync(), so filesystems can lock i_rwsem during a sync.
> > > 
> > > ---
> > >   fs/iomap/direct-io.c | 2 +-
> > >   1 file changed, 1 insertion(+), 1 deletion(-)
> > > 
> > > diff --git a/fs/iomap/direct-io.c b/fs/iomap/direct-io.c
> > > index d970c6bbbe11..e01f81e7b76f 100644
> > > --- a/fs/iomap/direct-io.c
> > > +++ b/fs/iomap/direct-io.c
> > > @@ -118,6 +118,7 @@ ssize_t iomap_dio_complete(struct iomap_dio *dio)
> > >   			dio_warn_stale_pagecache(iocb->ki_filp);
> > >   	}
> > > +	inode_dio_end(file_inode(iocb->ki_filp));
> > >   	/*
> > >   	 * If this is a DSYNC write, make sure we push it to stable storage now
> > >   	 * that we've written data.
> > > @@ -125,7 +126,6 @@ ssize_t iomap_dio_complete(struct iomap_dio *dio)
> > >   	if (ret > 0 && (dio->flags & IOMAP_DIO_NEED_SYNC))
> > >   		ret = generic_write_sync(iocb, ret);
> > > -	inode_dio_end(file_inode(iocb->ki_filp));
> > >   	kfree(dio);
> > >   	return ret;
> > > 
> > 
> > Did you verify that xfs or ext4 don't rely on the inode_dio_end() happening
> > before the generic_write_sync()?  I wouldn't expect that they would, but
> > we've already run into problems making those kind of assumptions.  If it's
> > fine you can add
> 
> I was gonna ask the same question, but as there's no SoB on this patch I
> hadn't really looked at it yet. ;)
> 
> Operations that rely on inode_dio_wait to have blocked until all the
> directios are complete could get tripped up by iomap not having done the
> generic_write_sync to stabilise the metadata, but I /think/ most
> operations that do that also themselves flush the file.  But I don't
> really know if there's a subtlety there if the inode_dio_wait thread
> manages to grab the ILOCK before the generic_write_sync thread does.

I did point out in the previous thread that this actually means that
inode_dio_wait() now has inconsistent wait semantics for O_DSYNC
writes. If it's a pure overwrite and we hit the FUA path, the
O_DSYNC write will be complete and guaranteed to be on stable storage
before the IO completes. If the inode is metadata dirty, then the IO
will now be signalled complete *before* the data and metadata are
flushed to stable storage.

Hence, from the perspective of writes to *stable* storage, this
makes the ordering of O_DSYNC DIO against anything waiting for it to
complete to be potentially inconsistent at the stable storage level.

That's an extremely subtle change of behaviour, and something that
would be largely impossible to test or reproduce. And, really, I
don't like having this sort of "oh, it should be fine" handwavy
justification when we are talking about data integrity operations...

Cheers,

Dave.
-- 
Dave Chinner
david@xxxxxxxxxxxxx



[Index of Archives]     [Linux Ext4 Filesystem]     [Union Filesystem]     [Filesystem Testing]     [Ceph Users]     [Ecryptfs]     [AutoFS]     [Kernel Newbies]     [Share Photos]     [Security]     [Netfilter]     [Bugtraq]     [Yosemite News]     [MIPS Linux]     [ARM Linux]     [Linux Security]     [Linux Cachefs]     [Reiser Filesystem]     [Linux RAID]     [Samba]     [Device Mapper]     [CEPH Development]

  Powered by Linux