On Tue, Jan 28, 2025 at 07:49:17AM +1100, Dave Chinner wrote: > > As for why an exclusive lock is needed for append writes, it's because > > we don't want the EOF to be modified during the append write. > > We don't care if the EOF moves during the append write at the > filesystem level. We set kiocb->ki_pos = i_size_read() from > generic_write_checks() under shared locking, and if we then race > with another extending append write there are two cases: > > 1. the other task has already extended i_size; or > 2. we have two IOs at the same offset (i.e. at i_size). > > In either case, we don't need exclusive locking for the IO because > the worst thing that happens is that two IOs hit the same file > offset. IOWs, it has always been left up to the application > serialise RWF_APPEND writes on XFS, not the filesystem. I disagree. O_APPEND (RWF_APPEND is just the Linux-specific per-I/O version of that) is extensively used for things like multi-thread loggers where you have multiple threads doing O_APPEND writes to a single log file, and they expect to not lose data that way. The fact that we currently don't do that for O_DIRECT is a bug, which is just papered over that barely anyone uses O_DIRECT | O_APPEND as that's not a very natural use case for most applications (in fact NFS got away with never allowing it at all). But extending racy O_APPEND to buffered writes would break a lot of applications.