Re: [RFC PATCH] btrfs: don't call btrfs_sync_file from iomap context

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



On Thu, Sep 17, 2020 at 04:29:23PM +1000, Dave Chinner wrote:
> > inode_dio_wait really just waits for active I/O that writes to or reads
> > from the file.  It does not imply that the I/O is stable, just like
> > i_rwsem itself doesn't.
> 
> No, but iomap_dio_rw() considers a O_DSYNC write to be incomplete
> until it is stable so that it presents consistent behaviour to
> anythign calling inode_dio_wait().

But that point is that inode_dio_wait does not care about that
"consistency".  It cares about when the I/O is done.  I know because I
wrote it (and I regret that as we should have stuck with the non-owner
release of the rwsem which makes a whole lot more sense).

> 
> > Various file systems have historically called
> > the syncing outside i_rwsem and inode_dio_wait (in fact that is what the
> > fs/direct-io.c code does, so XFS did as well until a few years ago), and
> > that isn't a problem at all - we just can't return to userspace (or call
> > ki_complete for in-kernel users) before the data is stable on disk.
> 
> I'm really not caring about userspace here - we use inode_dio_wait()
> as an IO completion notification for the purposes of synchronising
> internal filesystem state before modifying user data via direct
> metadata manipulation. Hence I want sane, consistent, predictable IO
> completion notification behaviour regardless of the implementation
> path it goes through.

And none of that consistency matters.  Think of it:

 - an O_(D)SYNC write is nothing but a write plus a ranged fsync,
   even if we do some optimizations to speed up the fsync by e.g.
   using the FUA flag
 - another fsync can come up at any time after we completed a write
   (with or without O_SYNC)
 - so any synchronization using inode_dio_wait (or i_rwsem for that
   matter) must not care if an fsync runs in parallel.
 - take a look at where we call inode_dio_wait to verify this - the
   prime original use case was truncate as we can't have I/O in
   progress while trunating.  We then later extended it to all the
   truncate-like more compliated operations like hole punches, extent
   insert an collapse, etc.  But in all that cases what matters is
   the actual I/O, not the sync.  By having done direct I/O the
   page cache side of the sync doesn't matter to start with (but
   the callers all invalidate it anyway), so what matter is the metadata
   flush, aka the log force in the XFS case.  And for that we absolutely
   do not need to be before inode_dio_wait returns.

> 
> Cheers,
> 
> Dave.
> -- 
> Dave Chinner
> david@xxxxxxxxxxxxx
---end quoted text---



[Index of Archives]     [Linux Ext4 Filesystem]     [Union Filesystem]     [Filesystem Testing]     [Ceph Users]     [Ecryptfs]     [AutoFS]     [Kernel Newbies]     [Share Photos]     [Security]     [Netfilter]     [Bugtraq]     [Yosemite News]     [MIPS Linux]     [ARM Linux]     [Linux Security]     [Linux Cachefs]     [Reiser Filesystem]     [Linux RAID]     [Samba]     [Device Mapper]     [CEPH Development]

  Powered by Linux