Re: [Lsf-pc] [LSF/MM/BPF TOPIC] Measuring limits and enhancing buffered IO

Dave Chinner <david@xxxxxxxxxxxxx> · Thu, 29 Feb 2024 11:25:33 +1100

On Wed, Feb 28, 2024 at 09:48:46AM +0200, Amir Goldstein wrote:
> On Wed, Feb 28, 2024 at 12:42 AM Dave Chinner via Lsf-pc
> <lsf-pc@xxxxxxxxxxxxxxxxxxxxxxxxxx> wrote:
> >
> > On Tue, Feb 27, 2024 at 05:21:20PM -0500, Kent Overstreet wrote:
> > > On Wed, Feb 28, 2024 at 09:13:05AM +1100, Dave Chinner wrote:
> > > > On Tue, Feb 27, 2024 at 05:07:30AM -0500, Kent Overstreet wrote:
> > > > > AFAIK every filesystem allows concurrent direct writes, not just xfs,
> > > > > it's _buffered_ writes that we care about here.
> > > >
> > > > We could do concurrent buffered writes in XFS - we would just use
> > > > the same locking strategy as direct IO and fall back on folio locks
> > > > for copy-in exclusion like ext4 does.
> > >
> > > ext4 code doesn't do that. it takes the inode lock in exclusive mode,
> > > just like everyone else.
> >
> > Uhuh. ext4 does allow concurrent DIO writes. It's just much more
> > constrained than XFS. See ext4_dio_write_checks().
> >
> > > > The real question is how much of userspace will that break, because
> > > > of implicit assumptions that the kernel has always serialised
> > > > buffered writes?
> > >
> > > What would break?
> >
> > Good question. If you don't know the answer, then you've got the
> > same problem as I have. i.e. we don't know if concurrent
> > applications that use buffered IO extensively (eg. postgres) assume
> > data coherency because of the implicit serialisation occurring
> > during buffered IO writes?
> >
> > > > > If we do a short write because of a page fault (despite previously
> > > > > faulting in the userspace buffer), there is no way to completely prevent
> > > > > torn writes an atomicity breakage; we could at least try a trylock on
> > > > > the inode lock, I didn't do that here.
> > > >
> > > > As soon as we go for concurrent writes, we give up on any concept of
> > > > atomicity of buffered writes (esp. w.r.t reads), so this really
> > > > doesn't matter at all.
> > >
> > > We've already given up buffered write vs. read atomicity, have for a
> > > long time - buffered read path takes no locks.
> >
> > We still have explicit buffered read() vs buffered write() atomicity
> > in XFS via buffered reads taking the inode lock shared (see
> > xfs_file_buffered_read()) because that's what POSIX says we should
> > have.
> >
> > Essentially, we need to explicitly give POSIX the big finger and
> > state that there are no atomicity guarantees given for write() calls
> > of any size, nor are there any guarantees for data coherency for
> > any overlapping concurrent buffered IO operations.
> >
> 
> I have disabled read vs. write atomicity (out-of-tree) to make xfs behave
> as the other fs ever since Jan has added the invalidate_lock and I believe
> that Meta kernel has done that way before.
> 
> > Those are things we haven't completely given up yet w.r.t. buffered
> > IO, and enabling concurrent buffered writes will expose to users.
> > So we need to have explicit policies for this and document them
> > clearly in all the places that application developers might look
> > for behavioural hints.
> 
> That's doable - I can try to do that.
> What is your take regarding opt-in/opt-out of legacy behavior?

Screw the legacy code, don't even make it an option. No-one should
be relying on large buffered writes being atomic anymore, and with
high order folios in the page cache most small buffered writes are
going to be atomic w.r.t. both reads and writes anyway.

> At the time, I have proposed POSIX_FADV_TORN_RW API [1]
> to opt-out of the legacy POSIX behavior, but I guess that an xfs mount
> option would make more sense for consistent and clear semantics across
> the fs - it is easier if all buffered IO to inode behaved the same way.

No mount options, just change the behaviour. Applications already
have to avoid concurrent overlapping buffered reads and writes if
they care about data integrity and coherency, so making buffered
writes concurrent doesn't change anything.

Dave.
-- 
Dave Chinner
david@xxxxxxxxxxxxx