Re: [Lsf-pc] [LSF/MM/BPF TOPIC] Measuring limits and enhancing buffered IO

Kent Overstreet <kent.overstreet@xxxxxxxxx> · Wed, 28 Feb 2024 19:57:38 -0500

On Thu, Feb 29, 2024 at 11:25:33AM +1100, Dave Chinner wrote:
> On Wed, Feb 28, 2024 at 09:48:46AM +0200, Amir Goldstein wrote:
> > On Wed, Feb 28, 2024 at 12:42 AM Dave Chinner via Lsf-pc
> > <lsf-pc@xxxxxxxxxxxxxxxxxxxxxxxxxx> wrote:
> > >
> > > On Tue, Feb 27, 2024 at 05:21:20PM -0500, Kent Overstreet wrote:
> > > > On Wed, Feb 28, 2024 at 09:13:05AM +1100, Dave Chinner wrote:
> > > > > On Tue, Feb 27, 2024 at 05:07:30AM -0500, Kent Overstreet wrote:
> > > > > > AFAIK every filesystem allows concurrent direct writes, not just xfs,
> > > > > > it's _buffered_ writes that we care about here.
> > > > >
> > > > > We could do concurrent buffered writes in XFS - we would just use
> > > > > the same locking strategy as direct IO and fall back on folio locks
> > > > > for copy-in exclusion like ext4 does.
> > > >
> > > > ext4 code doesn't do that. it takes the inode lock in exclusive mode,
> > > > just like everyone else.
> > >
> > > Uhuh. ext4 does allow concurrent DIO writes. It's just much more
> > > constrained than XFS. See ext4_dio_write_checks().
> > >
> > > > > The real question is how much of userspace will that break, because
> > > > > of implicit assumptions that the kernel has always serialised
> > > > > buffered writes?
> > > >
> > > > What would break?
> > >
> > > Good question. If you don't know the answer, then you've got the
> > > same problem as I have. i.e. we don't know if concurrent
> > > applications that use buffered IO extensively (eg. postgres) assume
> > > data coherency because of the implicit serialisation occurring
> > > during buffered IO writes?
> > >
> > > > > > If we do a short write because of a page fault (despite previously
> > > > > > faulting in the userspace buffer), there is no way to completely prevent
> > > > > > torn writes an atomicity breakage; we could at least try a trylock on
> > > > > > the inode lock, I didn't do that here.
> > > > >
> > > > > As soon as we go for concurrent writes, we give up on any concept of
> > > > > atomicity of buffered writes (esp. w.r.t reads), so this really
> > > > > doesn't matter at all.
> > > >
> > > > We've already given up buffered write vs. read atomicity, have for a
> > > > long time - buffered read path takes no locks.
> > >
> > > We still have explicit buffered read() vs buffered write() atomicity
> > > in XFS via buffered reads taking the inode lock shared (see
> > > xfs_file_buffered_read()) because that's what POSIX says we should
> > > have.
> > >
> > > Essentially, we need to explicitly give POSIX the big finger and
> > > state that there are no atomicity guarantees given for write() calls
> > > of any size, nor are there any guarantees for data coherency for
> > > any overlapping concurrent buffered IO operations.
> > >
> > 
> > I have disabled read vs. write atomicity (out-of-tree) to make xfs behave
> > as the other fs ever since Jan has added the invalidate_lock and I believe
> > that Meta kernel has done that way before.
> > 
> > > Those are things we haven't completely given up yet w.r.t. buffered
> > > IO, and enabling concurrent buffered writes will expose to users.
> > > So we need to have explicit policies for this and document them
> > > clearly in all the places that application developers might look
> > > for behavioural hints.
> > 
> > That's doable - I can try to do that.
> > What is your take regarding opt-in/opt-out of legacy behavior?
> 
> Screw the legacy code, don't even make it an option. No-one should
> be relying on large buffered writes being atomic anymore, and with
> high order folios in the page cache most small buffered writes are
> going to be atomic w.r.t. both reads and writes anyway.

That's a new take...

> 
> > At the time, I have proposed POSIX_FADV_TORN_RW API [1]
> > to opt-out of the legacy POSIX behavior, but I guess that an xfs mount
> > option would make more sense for consistent and clear semantics across
> > the fs - it is easier if all buffered IO to inode behaved the same way.
> 
> No mount options, just change the behaviour. Applications already
> have to avoid concurrent overlapping buffered reads and writes if
> they care about data integrity and coherency, so making buffered
> writes concurrent doesn't change anything.

Honestly - no.

Userspace would really like to see some sort of definition for this kind
of behaviour, and if we just change things underneath them without
telling anyone, _that's a dick move_.

POSIX_FADV_TORN_RW is a terrible name, though.

And fadvise() is the wrong API for this because it applies to ranges,
this should be an open flag or an fcntl.