Re: [Lsf-pc] [LSF/MM/BPF TOPIC] Measuring limits and enhancing buffered IO

Dave Chinner <david@xxxxxxxxxxxxx> · Mon, 4 Mar 2024 11:46:52 +1100

On Wed, Feb 28, 2024 at 07:57:38PM -0500, Kent Overstreet wrote:
> On Thu, Feb 29, 2024 at 11:25:33AM +1100, Dave Chinner wrote:
> > > That's doable - I can try to do that.
> > > What is your take regarding opt-in/opt-out of legacy behavior?
> > 
> > Screw the legacy code, don't even make it an option. No-one should
> > be relying on large buffered writes being atomic anymore, and with
> > high order folios in the page cache most small buffered writes are
> > going to be atomic w.r.t. both reads and writes anyway.
> 
> That's a new take...
> 
> > 
> > > At the time, I have proposed POSIX_FADV_TORN_RW API [1]
> > > to opt-out of the legacy POSIX behavior, but I guess that an xfs mount
> > > option would make more sense for consistent and clear semantics across
> > > the fs - it is easier if all buffered IO to inode behaved the same way.
> > 
> > No mount options, just change the behaviour. Applications already
> > have to avoid concurrent overlapping buffered reads and writes if
> > they care about data integrity and coherency, so making buffered
> > writes concurrent doesn't change anything.
> 
> Honestly - no.
> 
> Userspace would really like to see some sort of definition for this kind
> of behaviour, and if we just change things underneath them without
> telling anyone, _that's a dick move_.

I don't think you understand the full picture here, Kent.

> POSIX_FADV_TORN_RW is a terrible name, though.

The described behaviour for this advice is the standard behaviour
for ext4, btrfs and most linux filesystems other than XFS. It has
been for a -long- time.

The only filesystem that gives anything resembling POSIX atomic
write behaviour is XFS. No other filesystem in Linux actually
provides the POSIX "buffered reads won't see partial data from
buffered writes in progress" behaviour that XFS does via the IOLOCK
behaviour it uses.

So when I say "screw the legacy apps" I'm talking about the ancient
enterprise applications that still behave as if this POSIX behaviour
is reliable on modern linux systems. It simply isn't, and these apps
are *already implicitly broken* on most Linux filesystems and they
need fixing.

> And fadvise() is the wrong API for this because it applies to ranges,
> this should be an open flag or an fcntl.

Not only is it the wrong API, it's also the wrong approach to take.
We have a new API coming through for atomic writes: RWF_ATOMIC.

If an applications needs an actual atomic IO guarantee, they will
soon be able to be explicit in their requirements and they will not
end up in the situation where the filesystem they use might
determine if there is an implicit atomic write behaviour provided.

Indeed, we don't actually say that XFS will always guarantee POSIX
atomic buffered IO semantics - we've just never decided that the
time is right to change the behaviour.

In making such a change to XFS, normal buffered writes will get
mostly the same behaviour as they do now because we now use high
order folios in the page cache and serialisation will be done
against high-order ranges rather than individual pages. And
applications that actually need atomic IO semantics can use
RWF_ATOMIC and in that case we can do explicitly serialised buffered
writes that lock out concurrent buffered reads as we do right now.

IOWs, there is no better time to convert XFS behaviour to match all
the other Linux filesystems than right now. Applications that need
atomic IO guarantees can use RWF_ATOMIC, and everyone else can get
the performance benefits that come from no longer trying to make
buffered IO implicitly "atomic".

-Dave.
-- 
Dave Chinner
david@xxxxxxxxxxxxx