Re: [PATCH] xfs: Remove i_rwsem lock in buffered read

Dave Chinner <david@xxxxxxxxxxxxx> · Wed, 29 Jan 2025 11:59:53 +1100

On Mon, Jan 27, 2025 at 09:15:41PM -0800, Christoph Hellwig wrote:
> On Tue, Jan 28, 2025 at 07:49:17AM +1100, Dave Chinner wrote:
> > > As for why an exclusive lock is needed for append writes, it's because
> > > we don't want the EOF to be modified during the append write.
> > 
> > We don't care if the EOF moves during the append write at the
> > filesystem level. We set kiocb->ki_pos = i_size_read() from
> > generic_write_checks() under shared locking, and if we then race
> > with another extending append write there are two cases:
> > 
> > 	1. the other task has already extended i_size; or
> > 	2. we have two IOs at the same offset (i.e. at i_size).
> > 
> > In either case, we don't need exclusive locking for the IO because
> > the worst thing that happens is that two IOs hit the same file
> > offset. IOWs, it has always been left up to the application
> > serialise RWF_APPEND writes on XFS, not the filesystem.
> 
> I disagree.  O_APPEND (RWF_APPEND is just the Linux-specific
> per-I/O version of that) is extensively used for things like
> multi-thread loggers where you have multiple threads doing O_APPEND
> writes to a single log file, and they expect to not lose data
> that way.

Sure, but we don't think we need full file offset-scope IO exclusion
to solve that problem.  We can still safely do concurrent writes
within EOF to occur whilst another buffered append write is doing
file extension work.

IOWs, where we really need to get to is a model that allows
concurrent buffered IO at all times, except for the case where IO
operations that change the inode size need to serialise against
other similar operations (e.g. other EOF extending IOs, truncate,
etc).

Hence I think we can largely ignore O_APPEND for the
purposes of prototyping shared buffered IO and getting rid of the
IOLOCK from the XFS IO path. I may end up re-using the i_rwsem as
a "EOF modification" serialisation mechanism for O_APPEND and
extending writes in general, but I don't think we need a general
write IO exclusion mechanism for this...

-Dave.
-- 
Dave Chinner
david@xxxxxxxxxxxxx