Re: [PATCH v4] xfs: allow read IO and FICLONE to run concurrently

Dave Chinner <david@xxxxxxxxxxxxx> · Mon, 23 Oct 2023 09:42:59 +1100

On Fri, Oct 20, 2023 at 08:34:48AM -0700, Darrick J. Wong wrote:
> On Thu, Oct 19, 2023 at 11:06:42PM -0700, Christoph Hellwig wrote:
> > On Thu, Oct 19, 2023 at 01:04:11PM -0700, Darrick J. Wong wrote:
> > > Well... the stupid answer is that I augmented generic/176 to try to race
> > > buffered and direct reads with cloning a million extents and print out
> > > when the racing reads completed.  On an unpatched kernel, the reads
> > > don't complete until the reflink does:
> > 
> > > So as you can see, reads from the reflink source file no longer
> > > experience a giant latency spike.  I also wrote an fstest to check this
> > > behavior; I'll attach it as a separate reply.
> > 
> > Nice.  I guess write latency doesn't really matter for this use
> > case?
> 
> Nope -- they've gotten libvirt to tell qemu to redirect vm disk writes
> to a new sidecar file.  Then they reflink the original source file to
> the backup file, but they want qemu to be able to service reads from
> that original source file while the reflink is ongoing.  When the backup
> is done, they commit the sidecar contents back into the original image.
> 
> It would be kinda neat if we had file range locks.  Regular progress
> could shorten the range as it makes progress.  If the thread doing the
> reflink could find out that another thread has blocked on part of the
> file range, it could even hurry up and clone that part so that neither
> reads nor writes would see enormous latency spikes.
> 
> Even better, we could actually support concurrent reads and writes to
> the page cache as long as the ranges don't overlap.  But that's all
> speculative until Dave dumps his old ranged lock patchset on the list.

The unfortunate reality is that range locks as I was trying to
implement them didn't scale - it was a failed experiment.

The issue is the internal tracking structure of a range lock. It has
to be concurrency safe itself, and even with lockless tree
structures using per-node seqlocks for internal sequencing, they
still rely on atomic ops for safe concurrent access and updates.

Hence the best I could get out of an uncontended range lock (i.e.
locking different exclusive ranges concurrently) was about 400,000
lock/unlock operations per second before the internal tracking
structure broke down under concurrent modification pressure.  That
was a whole lot better than previous attempts that topped out at
~150,000 lock/unlock ops/s, but it's still far short of the ~3
million concurrent shared lock/unlock ops/s than a rwsem could do on
that same machine.

Worse for range locks was that once passed peak performance,
internal contention within the range lock caused performance to fall
off a cliff and ends up being much worse than just using pure
exclusive locking with a mutex.

Hence without some novel new internal lockless and memory allocation
free tracking structure and algorithm, range locks will suck for the
one thing we want them for: high performance, highly concurrent
access to discrete ranges of a single file.

-Dave.

-- 
Dave Chinner
david@xxxxxxxxxxxxx