Re: [POC][PATCH] xfs: reduce ilock contention on buffered randrw workload

Dave Chinner <david@xxxxxxxxxxxxx> · Fri, 3 May 2019 14:17:27 +1000

On Sun, Apr 21, 2019 at 09:54:12AM +1000, Dave Chinner wrote:
> On Thu, Apr 18, 2019 at 11:21:34AM -0700, Davidlohr Bueso wrote:
> > On Thu, 2019-04-18 at 13:10 +1000, Dave Chinner wrote:
> > > Now the stuff I've been working on has the same interface as
> > > Davidlohr's patch, so I can swap and change them without thinking
> > > about it. It's still completely unoptimised, but:
> > > 
> > > 			IOPS read/write (direct IO)
> > > processes	rwsem		DB rangelock	XFS
> > > rangelock
> > >  1		78k / 78k	75k / 75k	72k / 72k
> > >  2		131k / 131k	123k / 123k	133k / 133k
> > >  4		267k / 267k	183k / 183k	237k / 237k
> > >  8		372k / 372k	177k / 177k	265k / 265k
> > >  16		315k / 315k	135k / 135k	228k / 228k
> > > 
> > > It's still substantially faster than the interval tree code.
....
> > > /me goes off and thinks more about adding optimistic lock coupling
> > > to the XFS iext btree to get rid of the need for tree-wide
> > > locking altogether
> > 
> > I was not aware of this code.
> 
> It's relatively new, and directly tailored to the needs of caching
> the XFS extent tree - it's not really a generic btree in that it's
> record store format is the XFS on-disk extent record. i.e. it
> only stores 54 bits of start offset and 21 bits of length in it's 16
> byte records, and the rest of the space is for the record data.

SO now I have a mostly working OLC btree based on this tree which is
plumbed into xfsprogs userspace and some testing code. I think I can
say now that the code will actually work, and it /should/ scale
better than a rwsem.

The userspace test harness that I have ran a "thread profile" to
indicated scalability. Basically it ran each thread in a different
offset range and locked a hundred ranges and then unlocked them, and
then looped over this. The btree is a single level for the first 14
locks, 2-level for up to 210 locks, and 3-level for up to 3150
locks. Hence most of this testing results in the btree being 2-3
levels and so largely removes the global root node lock as a point
of contention. It's "best case" for concurrency for an OLC btree.

On a 16p machine:

		     Range lock/unlock ops/s
threads		mutex btree		OLC btree
  1		  5239442		  949487
  2		  1014466		 1398539
  4		   985940		 2405275
  8		   733195		 3288435
  16		   653429		 2809225

When looking at these numbers, remember that the mutex btree kernel
range lock performed a lot better than the interval tree range lock,
and they were only ~30% down on an rwsem. The mutex btree code shows
cache residency effects for the single threaded load, hence it looks
much faster than it is for occasional and multithreaded access.

However, at 2 threads (where hot CPU caches don't affect the
performance), the OLC btree is 40% faster, and at 8 threads it is
4.5x faster than the mutex btree. The OLC btree starts slowing down
at 16 threads, largely because the tree itself doesn't have enough
depth to provide the interior nodes to scale to higher concurrency
levels without contention, but it's still running at 4.5x faster
than the mutex btree....

The best part is when I run worse case threaded workloads on the
OLC btree. If I run the same 100-lock loops, but this time change
the offsets of each thread so they interleave into adjacent records
in the btree (i.e. every thread touches every leaf), then the
performance is still pretty damn good:

		     Range lock/unlock ops/s
threads		Worst Case		Best Case
  1		  1045991		  949487
  2		  1530212		 1398539
  4		  1147099		 2405275
  8		  1602114		 3288435
  16		  1731890		 2809225

IOWs, performance is down and somewhat variable around tree
height changes (4 threads straddles the 2-3 level tree height
threshold), but it's still a massive improvement on the mutex_btree
and it's not going backwards as threads are added.

Concept proven.

Next steps are:

	- separate the OLC btree from the XFS iext btree
	  implementation. It will still have a similar interface
	  (i.e. can't manipulate the btree records directly), but
	  there's sufficient difference in structure for them to be
	  separate implementations.
	- expand records out to full 64bit extents. The iext tree
	  memory usage constraints no longer apply, so the record
	  size can go up a little bit.
	- work out whether RCU read locking and kfree_rcu() will
	  work with the requirement to do memory allocation while
	  holding rcu_read_lock(). Alternative is an internal
	  garbage collector mechanism, kinda like I've hacked up to
	  simulate kfree_rcu() in userspace.
	- fix all the little bugs that still exist in the code.
	- Think about structural optimisations like parent pointers
	  to avoid costly path walks to find parents for 
	  modifications.

Cheers,

Dave.
-- 
Dave Chinner
david@xxxxxxxxxxxxx