On Wed, Aug 07, 2024 at 11:01:36AM +0800, Yafang Shao wrote: > On Wed, Aug 7, 2024 at 5:52 AM Dave Chinner <david@xxxxxxxxxxxxx> wrote: > > > > On Tue, Aug 06, 2024 at 10:05:50PM +0800, Yafang Shao wrote: > > > On Tue, Aug 6, 2024 at 9:24 PM Jan Kara <jack@xxxxxxx> wrote: > > > > On Tue 06-08-24 19:54:58, Yafang Shao wrote: > > > > > Its guarantee is clear: > > > > > > > > > > : I/O is intended to be atomic to ordinary files and pipes and FIFOs. > > > > > : Atomic means that all the bytes from a single operation that started > > > > > : out together end up together, without interleaving from other I/O > > > > > : operations. > > > > > > > > Oh, I understand why XFS does locking this way and I'm well aware this is > > > > a requirement in POSIX. However, as you have experienced, it has a > > > > significant performance cost for certain workloads (at least with simple > > > > locking protocol we have now) and history shows users rather want the extra > > > > performance at the cost of being a bit more careful in userspace. So I > > > > don't see any filesystem switching to XFS behavior until we have a > > > > performant range locking primitive. > > > > > > > > > What this flag does is avoid waiting for this type of lock if it > > > > > exists. Maybe we should consider a more descriptive name like > > > > > RWF_NOATOMICWAIT, RWF_NOFSLOCK, or RWF_NOPOSIXWAIT? Naming is always > > > > > challenging. > > > > > > > > Aha, OK. So you want the flag to mean "I don't care about POSIX read-write > > > > exclusion". I'm still not convinced the flag is a great idea but > > > > RWF_NOWRITEEXCLUSION could perhaps better describe the intent of the flag. > > > > > > That's better. Should we proceed with implementing this new flag? It > > > provides users with an option to avoid this type of issue. > > > > No. If we are going to add a flag like that, the fix to XFS isn't to > > use IOCB_NOWAIT on reads, it's to use shared locking for buffered > > writes just like we do for direct IO. > > > > IOWs, this flag would be needed on -writes-, not reads, and at that > > point we may as well just change XFS to do shared buffered writes > > for -everyone- so it is consistent with all other Linux filesystems. > > > > Indeed, last time Amir brought this up, I suggested that shared > > buffered write locking in XFS was the simplest way forward. Given > > that we use large folios now, small IOs get mapped to a single folio > > and so will still have the same write vs overlapping write exclusion > > behaviour most all the time. > > > > However, since then we've moved to using shared IO locking for > > cloning files. A clone does not modify data, so read IO is allowed > > during the clone. If we move writes to use shared locking, this > > breaks file cloning. We would have to move cloning back to to using > > exclusive locking, and that's going to cause performance and IO > > latency regressions for applications using clones with concurrent IO > > (e.g. VM image snapshots in cloud infrastruction). > > > > Hence the only viable solution to all these different competing "we > > need exclusive access to a range of the file whilst allowing other > > concurrent IO" issues is to move to range locking for IO > > exclusion.... > > The initial post you mentioned about range locking dates back to 2019, > five years ago. Now, five years have passed, and nothing has happened. > > In 2029, five years later, someone else might encounter this issue > again, and the response will be the same: "let's try range locking." > > And then another five years will pass... > > So, "range locking == Do nothing." How long do you think it would take you to understand the entire serialisation model for a complex subsystem, understand where it is deficient and then design a novel, scalable serialisation technique that addresses that problem with only limited IO performance regressions? Some context: It took me 6 years to work out how to do delayed logging in XFS once I first learnt of the idea back in 2004. It took me 4 -from scratch- design and implementation efforts before I found a solution that didn't have a subtle, fatal architectural issue in it. Once I solved the deadlock issues in early 2010, it took about 4 months from first code to being merged in 2.6.35. It took me 7 years to go from my initial "self describing metadata" idea (2006) ideas to actually having it implemented and fully merged in 2013. I described reverse mapping at the same time, and that took another couple of years to realise. It took me close to 10 years and 6 or 7 separate attemps to solve the "XFS needs blocking RMW IO in the inode reclaim shrinker to prevent OOM" problem. This caused memory allocation latency problems for many production environments over the course of a decade, and the problem was around that long because it took me an awful long time to work out how to pin inode cluster buffers in memory without deadlocking inode modification or inode cluster writeback. These sorts of complex problems are *hard to solve* and it often takes several attempts that fail to learn what -doesn't work- before a successful architecture, design and implementation is realised. Often, there is no prior art out there that we can use to help solve the problem. Sure, for range locking there's heaps of academic papers out there about scaling concurrency in database key operations (SIX, MGL, ARIES/KVL, multi-dimension key-range lock separation, intention locking, speculative lock inheritance, lightweight intent locks, etc), but none of the underlying algorithms have any significant relevance to the problem we need to solve. There's also papers aout there about how to scale concurrent btree modifications. Latching, lock coupling, optimistic lock coupling, etc. The problem with these papers is that they often gloss over or ignore important details such as how they deal with node contention, how concurrent unlocked traversal and modification to the same node are safely handled (i.e. NX concurrency algorithms), how a top-down traversal algorithm that doesn't guarantee temporal path stability is used for bottom up key update propagation (OLC-based algorithms), etc.D... They also tend to focus on huge static data sets where concurrent random operations are guaranteed to have unique paths and minimal contention. Hence the researchers are able to demonstrate how much better their new algorithm scales than the previous state of the art. However, they rarely demonstrate how the algorithm scales down, and that's something we really care about for range locks. A couple of the scalable range indexing algorithms I prototyped simply didn't work for small data sets - they performed far worse than just using a single tree-wide mutex. Hence we are in a situation where I can't find an algorithm in existing computer science literature that will work for our problem case. Hence we need to come up with a novel algorithm that solves the problem ourselves. This is an iterative process where we learn by failing and then understanding why what we did failed. Davidlohr Bueso attempted to solve mmap_sem issues with an interval tree based range lock. From that implementation, I learnt that the cost of per-range cacheline misses walking the rb-tree under the tree-wide spin lock was the major performance limitation of that range lock. I took that observation, and attempted to adapt that code to a btree (based on the XFS iext btree). That performed a little better, but removing the cacheline miss penalty from the search algorithm simply moved the bottleneck to the spin lock. i.e. I learnt that we can't use a global scope spin lock for protecting the range lock tree - the total number of range lock/unlock operations is bound entirely by how many we can process on a single CPU because they are all done under a single spinlock. Further, per-inode IO concurrency is unbound, but a single inode-wide serialisation primitive will not scale beyond a 3-4 CPUs doing IO at the same time. This taught me IO path usage of per-IO, per-inode exclusive cachelines needs to be avoided if at all possible. I made another attempt a year or so later after doing a lot of reading about scalable btree structures. The second attempt expanded the original btree I used with an OLC based architecture. I called in an RCU-pathwalk btree because it used RCU and sequence numbers in a very similar way to the dentry cache RCU pathwalk algorithm. This provided range locks with a high concurrency range tracking structure. In some situations it scaled out to millions of lock/unlock operations per second (far exceeding perf/scalability requirements), but in others performance was miserably and was 10x slower than plain exclusive locking. IOWs, another failure. Again, I learn a lot about what not to do from that attempt. I've spent a good deal of time in the past two years reading through decades of database key range locking optimisation papers in an effort to understand how I might address the issues I came across. I have also had time to understand why several implementation issues existed and how to avoid/mitigate them in the new design I'm currently working on. So, yeah, I haven't written any rangelock code in the past couple of years, but: $ wc -l Documentation/filesystems/xfs-IO-rangelocks.rst 749 Documentation/filesystems/xfs-IO-rangelocks.rst $ I've been writing up a design doc as I've been doing this analysis and research to document the problems and the solutions. I think I'm close to the point where I can start implementing this new design. Clearly, I've been busy doing a lot of nothing on rangelocks, thank you very much. > I'm not saying it's your > responsibility to implement range locking, but it seems no one else is > capable of implementing this complex feature except you. *cough* There's lots of people better qualified than me to solve a problem like this. It's a computer science problem, and I'm not a computer scientist. I'm an engineer - I take stuff that scientists have discovered and documented and build tools, infrastructure and objects based on that knowledge. What I'm not good at is coming up with novel new algorithms to solve mathematical problems. A range lock is the latter, not the former, and there are plenty of people who would be better suited to this work than me. i.e. I'm good at putting knowledge to practical use, not creating new knowledge. However, I'm spending time on it because nobody else is going to solve the problem for me. CPU and IO concurrency is only going up and shared/exclusive IO locking behaviour only gets more restrictive as concurrency requirements go up and we use extent sharing to avoid data copying more extensively. This problem isn't going away... -Dave. -- Dave Chinner david@xxxxxxxxxxxxx