On Sun, Apr 21, 2019 at 09:54:12AM +1000, Dave Chinner wrote: > On Thu, Apr 18, 2019 at 11:21:34AM -0700, Davidlohr Bueso wrote: > > On Thu, 2019-04-18 at 13:10 +1000, Dave Chinner wrote: > > > Now the stuff I've been working on has the same interface as > > > Davidlohr's patch, so I can swap and change them without thinking > > > about it. It's still completely unoptimised, but: > > > > > > IOPS read/write (direct IO) > > > processes rwsem DB rangelock XFS > > > rangelock > > > 1 78k / 78k 75k / 75k 72k / 72k > > > 2 131k / 131k 123k / 123k 133k / 133k > > > 4 267k / 267k 183k / 183k 237k / 237k > > > 8 372k / 372k 177k / 177k 265k / 265k > > > 16 315k / 315k 135k / 135k 228k / 228k > > > > > > It's still substantially faster than the interval tree code. .... > > > /me goes off and thinks more about adding optimistic lock coupling > > > to the XFS iext btree to get rid of the need for tree-wide > > > locking altogether > > > > I was not aware of this code. > > It's relatively new, and directly tailored to the needs of caching > the XFS extent tree - it's not really a generic btree in that it's > record store format is the XFS on-disk extent record. i.e. it > only stores 54 bits of start offset and 21 bits of length in it's 16 > byte records, and the rest of the space is for the record data. SO now I have a mostly working OLC btree based on this tree which is plumbed into xfsprogs userspace and some testing code. I think I can say now that the code will actually work, and it /should/ scale better than a rwsem. The userspace test harness that I have ran a "thread profile" to indicated scalability. Basically it ran each thread in a different offset range and locked a hundred ranges and then unlocked them, and then looped over this. The btree is a single level for the first 14 locks, 2-level for up to 210 locks, and 3-level for up to 3150 locks. Hence most of this testing results in the btree being 2-3 levels and so largely removes the global root node lock as a point of contention. It's "best case" for concurrency for an OLC btree. On a 16p machine: Range lock/unlock ops/s threads mutex btree OLC btree 1 5239442 949487 2 1014466 1398539 4 985940 2405275 8 733195 3288435 16 653429 2809225 When looking at these numbers, remember that the mutex btree kernel range lock performed a lot better than the interval tree range lock, and they were only ~30% down on an rwsem. The mutex btree code shows cache residency effects for the single threaded load, hence it looks much faster than it is for occasional and multithreaded access. However, at 2 threads (where hot CPU caches don't affect the performance), the OLC btree is 40% faster, and at 8 threads it is 4.5x faster than the mutex btree. The OLC btree starts slowing down at 16 threads, largely because the tree itself doesn't have enough depth to provide the interior nodes to scale to higher concurrency levels without contention, but it's still running at 4.5x faster than the mutex btree.... The best part is when I run worse case threaded workloads on the OLC btree. If I run the same 100-lock loops, but this time change the offsets of each thread so they interleave into adjacent records in the btree (i.e. every thread touches every leaf), then the performance is still pretty damn good: Range lock/unlock ops/s threads Worst Case Best Case 1 1045991 949487 2 1530212 1398539 4 1147099 2405275 8 1602114 3288435 16 1731890 2809225 IOWs, performance is down and somewhat variable around tree height changes (4 threads straddles the 2-3 level tree height threshold), but it's still a massive improvement on the mutex_btree and it's not going backwards as threads are added. Concept proven. Next steps are: - separate the OLC btree from the XFS iext btree implementation. It will still have a similar interface (i.e. can't manipulate the btree records directly), but there's sufficient difference in structure for them to be separate implementations. - expand records out to full 64bit extents. The iext tree memory usage constraints no longer apply, so the record size can go up a little bit. - work out whether RCU read locking and kfree_rcu() will work with the requirement to do memory allocation while holding rcu_read_lock(). Alternative is an internal garbage collector mechanism, kinda like I've hacked up to simulate kfree_rcu() in userspace. - fix all the little bugs that still exist in the code. - Think about structural optimisations like parent pointers to avoid costly path walks to find parents for modifications. Cheers, Dave. -- Dave Chinner david@xxxxxxxxxxxxx