Re: [man-pages RFC PATCH v4] statx, inode: document the new STATX_INO_VERSION field

Dave Chinner <david@xxxxxxxxxxxxx> · Tue, 13 Sep 2022 11:15:18 +1000

On Tue, Sep 13, 2022 at 09:29:48AM +1000, NeilBrown wrote:
> On Mon, 12 Sep 2022, Jeff Layton wrote:
> > On Sun, 2022-09-11 at 08:53 +1000, NeilBrown wrote:
> > > This could be expensive.
> > > 
> > > There is not currently any locking around O_DIRECT writes.  You cannot
> > > synchronise with them.
> > > 
> > 
> > AFAICT, DIO write() implementations in btrfs, ext4, and xfs all hold
> > inode_lock_shared across the I/O. That was why patch #8 takes the
> > inode_lock (exclusive) across the getattr.
> 
> Looking at ext4_dio_write_iter() it certain does take
> inode_lock_shared() before starting the write and in some cases it
> requests, using IOMAP_DIO_FORCE_WAIT, that imap_dio_rw() should wait for
> the write to complete.  But not in all cases.
> So I don't think it always holds the shared lock across all direct IO.

To serialise against dio writes, one must:

	// Lock the inode exclusively to block new DIO submissions
	inode_lock(inode);

	// Wait for all in flight DIO reads and writes to complete
	inode_dio_wait(inode);

This is how truncate, fallocate, etc serialise against AIO+DIO which
do not hold the inode lock across the entire IO. These have to
serialise aginst DIO reads, too, because we can't have IO in
progress over a range of the file that we are about to free....

> > Taking inode_lock_shared is sufficient to block out buffered and DAX
> > writes. DIO writes sometimes only take the shared lock (e.g. when the
> > data is already properly aligned). If we want to ensure the getattr
> > doesn't run while _any_ writes are running, we'd need the exclusive
> > lock.
> 
> But the exclusive lock is bad for scalability.

Serilisation against DIO is -expensive- and -slow-. It's not a
solution for what is supposed to be a fast unlocked read-only
operation like statx().

> > Maybe that's overkill, though it seems like we could have a race like
> > this without taking inode_lock across the getattr:
> > 
> > reader				writer
> > -----------------------------------------------------------------
> > 				i_version++
> > getattr
> > read
> > 				DIO write to backing store
> > 
> 
> This is why I keep saying that the i_version increment must be after the
> write, not before it.

Sure, but that ignores the reason why we actually need to bump
i_version *before* we submit a DIO write. DIO write invalidates the
page cache over the range of the write, so any racing read will
re-populate the page cache during the DIO write.

Hence buffered reads can return before the DIO write has completed,
and the contents of the read can contain, none, some or all of the
contents of the DIO write. Hence i_version has to be incremented
before the DIO write is submitted so that racing getattrs will
indicate that the local caches have been invalidated and that data
needs to be refetched.

But, yes, to actually be safe here, we *also* should be bumping
i_version on DIO write on DIO write completion so that racing
i_version and data reads that occur *after* the initial i_version
bump are invalidated immediately.

IOWs, to avoid getattr/read races missing stale data invalidations
during DIO writes, we really need to bump i_version both _before and
after_ DIO write submission.

It's corner cases like this where "i_version should only be bumped
when ctime changes" fails completely. i.e. there are concurrent IO
situations which can only really be handled correctly by bumping
i_version whenever either in-memory and/or on-disk persistent data/
metadata state changes occur.....

Cheers,

Dave.
-- 
Dave Chinner
david@xxxxxxxxxxxxx