On Wed, Feb 28, 2024 at 12:42 AM Dave Chinner via Lsf-pc <lsf-pc@xxxxxxxxxxxxxxxxxxxxxxxxxx> wrote: > > On Tue, Feb 27, 2024 at 05:21:20PM -0500, Kent Overstreet wrote: > > On Wed, Feb 28, 2024 at 09:13:05AM +1100, Dave Chinner wrote: > > > On Tue, Feb 27, 2024 at 05:07:30AM -0500, Kent Overstreet wrote: > > > > AFAIK every filesystem allows concurrent direct writes, not just xfs, > > > > it's _buffered_ writes that we care about here. > > > > > > We could do concurrent buffered writes in XFS - we would just use > > > the same locking strategy as direct IO and fall back on folio locks > > > for copy-in exclusion like ext4 does. > > > > ext4 code doesn't do that. it takes the inode lock in exclusive mode, > > just like everyone else. > > Uhuh. ext4 does allow concurrent DIO writes. It's just much more > constrained than XFS. See ext4_dio_write_checks(). > > > > The real question is how much of userspace will that break, because > > > of implicit assumptions that the kernel has always serialised > > > buffered writes? > > > > What would break? > > Good question. If you don't know the answer, then you've got the > same problem as I have. i.e. we don't know if concurrent > applications that use buffered IO extensively (eg. postgres) assume > data coherency because of the implicit serialisation occurring > during buffered IO writes? > > > > > If we do a short write because of a page fault (despite previously > > > > faulting in the userspace buffer), there is no way to completely prevent > > > > torn writes an atomicity breakage; we could at least try a trylock on > > > > the inode lock, I didn't do that here. > > > > > > As soon as we go for concurrent writes, we give up on any concept of > > > atomicity of buffered writes (esp. w.r.t reads), so this really > > > doesn't matter at all. > > > > We've already given up buffered write vs. read atomicity, have for a > > long time - buffered read path takes no locks. > > We still have explicit buffered read() vs buffered write() atomicity > in XFS via buffered reads taking the inode lock shared (see > xfs_file_buffered_read()) because that's what POSIX says we should > have. > > Essentially, we need to explicitly give POSIX the big finger and > state that there are no atomicity guarantees given for write() calls > of any size, nor are there any guarantees for data coherency for > any overlapping concurrent buffered IO operations. > I have disabled read vs. write atomicity (out-of-tree) to make xfs behave as the other fs ever since Jan has added the invalidate_lock and I believe that Meta kernel has done that way before. > Those are things we haven't completely given up yet w.r.t. buffered > IO, and enabling concurrent buffered writes will expose to users. > So we need to have explicit policies for this and document them > clearly in all the places that application developers might look > for behavioural hints. That's doable - I can try to do that. What is your take regarding opt-in/opt-out of legacy behavior? At the time, I have proposed POSIX_FADV_TORN_RW API [1] to opt-out of the legacy POSIX behavior, but I guess that an xfs mount option would make more sense for consistent and clear semantics across the fs - it is easier if all buffered IO to inode behaved the same way. Thanks, Amir. [1] https://lore.kernel.org/linux-xfs/CAOQ4uxguwnx4AxXqp_zjg39ZUaTGJEM2wNUPnNdtiqV2Q9woqA@xxxxxxxxxxxxxx/