On Fri, Oct 20, 2023 at 08:34:48AM -0700, Darrick J. Wong wrote: > On Thu, Oct 19, 2023 at 11:06:42PM -0700, Christoph Hellwig wrote: > > On Thu, Oct 19, 2023 at 01:04:11PM -0700, Darrick J. Wong wrote: > > > Well... the stupid answer is that I augmented generic/176 to try to race > > > buffered and direct reads with cloning a million extents and print out > > > when the racing reads completed. On an unpatched kernel, the reads > > > don't complete until the reflink does: > > > > > So as you can see, reads from the reflink source file no longer > > > experience a giant latency spike. I also wrote an fstest to check this > > > behavior; I'll attach it as a separate reply. > > > > Nice. I guess write latency doesn't really matter for this use > > case? > > Nope -- they've gotten libvirt to tell qemu to redirect vm disk writes > to a new sidecar file. Then they reflink the original source file to > the backup file, but they want qemu to be able to service reads from > that original source file while the reflink is ongoing. When the backup > is done, they commit the sidecar contents back into the original image. > > It would be kinda neat if we had file range locks. Regular progress > could shorten the range as it makes progress. If the thread doing the > reflink could find out that another thread has blocked on part of the > file range, it could even hurry up and clone that part so that neither > reads nor writes would see enormous latency spikes. > > Even better, we could actually support concurrent reads and writes to > the page cache as long as the ranges don't overlap. But that's all > speculative until Dave dumps his old ranged lock patchset on the list. The unfortunate reality is that range locks as I was trying to implement them didn't scale - it was a failed experiment. The issue is the internal tracking structure of a range lock. It has to be concurrency safe itself, and even with lockless tree structures using per-node seqlocks for internal sequencing, they still rely on atomic ops for safe concurrent access and updates. Hence the best I could get out of an uncontended range lock (i.e. locking different exclusive ranges concurrently) was about 400,000 lock/unlock operations per second before the internal tracking structure broke down under concurrent modification pressure. That was a whole lot better than previous attempts that topped out at ~150,000 lock/unlock ops/s, but it's still far short of the ~3 million concurrent shared lock/unlock ops/s than a rwsem could do on that same machine. Worse for range locks was that once passed peak performance, internal contention within the range lock caused performance to fall off a cliff and ends up being much worse than just using pure exclusive locking with a mutex. Hence without some novel new internal lockless and memory allocation free tracking structure and algorithm, range locks will suck for the one thing we want them for: high performance, highly concurrent access to discrete ranges of a single file. -Dave. -- Dave Chinner david@xxxxxxxxxxxxx