On Wed, Feb 28, 2024 at 07:57:38PM -0500, Kent Overstreet wrote: > On Thu, Feb 29, 2024 at 11:25:33AM +1100, Dave Chinner wrote: > > > That's doable - I can try to do that. > > > What is your take regarding opt-in/opt-out of legacy behavior? > > > > Screw the legacy code, don't even make it an option. No-one should > > be relying on large buffered writes being atomic anymore, and with > > high order folios in the page cache most small buffered writes are > > going to be atomic w.r.t. both reads and writes anyway. > > That's a new take... > > > > > > At the time, I have proposed POSIX_FADV_TORN_RW API [1] > > > to opt-out of the legacy POSIX behavior, but I guess that an xfs mount > > > option would make more sense for consistent and clear semantics across > > > the fs - it is easier if all buffered IO to inode behaved the same way. > > > > No mount options, just change the behaviour. Applications already > > have to avoid concurrent overlapping buffered reads and writes if > > they care about data integrity and coherency, so making buffered > > writes concurrent doesn't change anything. > > Honestly - no. > > Userspace would really like to see some sort of definition for this kind > of behaviour, and if we just change things underneath them without > telling anyone, _that's a dick move_. I don't think you understand the full picture here, Kent. > POSIX_FADV_TORN_RW is a terrible name, though. The described behaviour for this advice is the standard behaviour for ext4, btrfs and most linux filesystems other than XFS. It has been for a -long- time. The only filesystem that gives anything resembling POSIX atomic write behaviour is XFS. No other filesystem in Linux actually provides the POSIX "buffered reads won't see partial data from buffered writes in progress" behaviour that XFS does via the IOLOCK behaviour it uses. So when I say "screw the legacy apps" I'm talking about the ancient enterprise applications that still behave as if this POSIX behaviour is reliable on modern linux systems. It simply isn't, and these apps are *already implicitly broken* on most Linux filesystems and they need fixing. > And fadvise() is the wrong API for this because it applies to ranges, > this should be an open flag or an fcntl. Not only is it the wrong API, it's also the wrong approach to take. We have a new API coming through for atomic writes: RWF_ATOMIC. If an applications needs an actual atomic IO guarantee, they will soon be able to be explicit in their requirements and they will not end up in the situation where the filesystem they use might determine if there is an implicit atomic write behaviour provided. Indeed, we don't actually say that XFS will always guarantee POSIX atomic buffered IO semantics - we've just never decided that the time is right to change the behaviour. In making such a change to XFS, normal buffered writes will get mostly the same behaviour as they do now because we now use high order folios in the page cache and serialisation will be done against high-order ranges rather than individual pages. And applications that actually need atomic IO semantics can use RWF_ATOMIC and in that case we can do explicitly serialised buffered writes that lock out concurrent buffered reads as we do right now. IOWs, there is no better time to convert XFS behaviour to match all the other Linux filesystems than right now. Applications that need atomic IO guarantees can use RWF_ATOMIC, and everyone else can get the performance benefits that come from no longer trying to make buffered IO implicitly "atomic". -Dave. -- Dave Chinner david@xxxxxxxxxxxxx