Re: [PATCH] xfs: Remove i_rwsem lock in buffered read

Dave Chinner <david@xxxxxxxxxxxxx> · Tue, 14 Jan 2025 11:09:36 +1100

On Fri, Jan 10, 2025 at 06:07:48PM +0100, Amir Goldstein wrote:
> On Fri, Jan 10, 2025 at 12:28 AM Dave Chinner <david@xxxxxxxxxxxxx> wrote:
> > That said, I just had a left-field idea for a quasi-range lock
> > that may allow random writes to run concurrently and atomically
> > with reads.
> >
> > Essentially, we add an unsigned long to the inode, and use it as a
> > lock bitmap. That gives up to 64 "lock segments" for the buffered
> > write. We may also need a "segment size" variable....
> >
> > The existing i_rwsem gets taken shared unless it is an extending
> > write.
> >
> > For a non-extending write, we then do an offset->segment translation
> > and lock that bit in the bit mask. If it's already locked, we wait
> > on the lock bit. i.e. shared IOLOCK, exclusive write bit lock.
> >
> > The segments are evenly sized - say a minimum of 64kB each, but when
> > EOF is extended or truncated (which is done with the i_rwsem held
> > exclusive) the segment size is rescaled. As nothing can hold bit
> > locks while the i_rwsem is held exclusive, this will not race with
> > anything.
> >
> > If we are doing an extending write, we take the i_rwsem shared
> > first, then check if the extension will rescale the locks. If lock
> > rescaling is needed, we have to take the i_rwsem exclusive to do the
> > EOF extension. Otherwise, the bit lock that covers EOF will
> > serialise file extensions so it can be done under a shared i_rwsem
> > safely.
> >
> > This will allow buffered writes to remain atomic w.r.t. each other,
> > and potentially allow buffered reads to wait on writes to the same
> > segment and so potentially provide buffered read vs buffered write
> > atomicity as well.
> >
> > If we need more concurrency than an unsigned long worth of bits for
> > buffered writes, then maybe we can enlarge the bitmap further.
> >
> > I suspect this can be extended to direct IO in a similar way to
> > buffered reads, and that then opens up the possibility of truncate
> > and fallocate() being able to use the bitmap for range exclusion,
> > too.
> >
> > The overhead is likely minimal - setting and clearing bits in a
> > bitmap, as opposed to tracking ranges in a tree structure....
> >
> > Thoughts?
> 
> I think that's a very neat idea, but it will not address the reference
> benchmark.
> The reference benchmark I started the original report with which is similar
> to my understanding to the benchmark that Chi is running simulates the
> workload of a database writing with buffered IO.
> 
> That means a very large file and small IO size ~64K.
> Leaving the probability of intersecting writes in the same segment quite high.

Likely - I recognised this granularity problem, though:

| If we need more concurrency than an unsigned long worth of bits for
| buffered writes, then maybe we can enlarge the bitmap further.

We could also hash offsets into the bitmap so that offset-local IO
hit different locks - the bit lock doesn't necessarily need to be
range based. i.e. this is a "finer grained" lock that will typically
increase concurrency. If we keep striving for perfect (i.e. scalable
range locks) we're not going to improve the situation any time
soon...

> Can we do this opportunistically based on available large folios?
> If IO size is within an existing folio, use the folio lock and IOLOCK_SHARED
> if it is not, use IOLOCK_EXCL?

The biggest problem with this is that direct IO will -not- do
folio-by folio locking, and so folio based locking does not work for
direct IO exclusion. Currently we get coherency against buffered
writes by writing back or invalidation dirty folios before doing
the DIO read/write. Because we hold the IOLOCK shared, a buffered
write will not redirty the page cache until the DIO write has been
submitted (and completed for non-async DIO).

Hence moving to shared i_rwsem, folio-based range locking for
buffered writes will lose all serialisation against DIO operations.
We will lose what coherency we currently have between buffered write
ops and DIO, and I don't think that's an acceptible trade-off.

i.e. The problem with using the mapping tree for DIO coherency is,
once again, locking overhead. If we have to insert exceptional
entries to lock the range in the mapping tree because there is no
folio present (for DIO to serialise against new buffered IOs), then
we are simply back to the same exclusive tree update scalability
problem that tree based range lock algorithms have....

That's why I suggested a bitmap lock external to both buffered and
direct IO...

> for a benchmark that does all buffered IO 64K aligned, wouldn't large folios
> naturally align to IO size and above?

Maybe, but I don't think folio/mapping tree based locking is a
workable solution. Hence the external bitmap idea...

-Dave.
-- 
Dave Chinner
david@xxxxxxxxxxxxx