Re: [PATCH RFC 0/2] rcu skiplists v2

Mathieu Desnoyers <mathieu.desnoyers@xxxxxxxxxxxx> · Wed, 26 Jun 2013 22:29:36 -0400

* Chris Mason (clmason@xxxxxxxxxxxx) wrote:
> Quoting Mathieu Desnoyers (2013-06-26 19:02:18)
> > > The IOMMU part of the patch is updated slightly, but still not using all
> > > the new bits in the API.  This is mostly because the IOMMU part is going
> > > to change later on and I'm really only using it for testing now.
> > > 
> > > For benchmarking, the IOMMU numbers are slightly higher than last time,
> > > but not more than 5% or so.  This is because the bottleneck is still
> > > skiplist_insert_hole(), which I haven't really changed in this round.
> > > 
> > > Most of my benchmarking now is with the skiplist_test module, which has
> > > expanded to a few new cases.  For random operations, my performance is
> > > slightly slower than rbtree when single threaded.
> > > 
> > > Random lookups on 10 million items: (inserts are similar)
> > > 
> > > rbtree       random lookup time 4 s 327 ms
> > > skiplist-rcu random lookup time 5 s 175 ms 
> > 
> > These benchmarks are only considering averages.
> > 
> > I'm worried that the skiplist approach has chances to create long chains
> > (probabilistically non-zero). Clearly, the lack of bound is something
> > the RT guys might frown upon. I'd be curious to see how long the worse
> > chains encountered can be on, say, 1000 machines stress-testing this for
> > a week.
> 
> Correct, the probability part may run into problems for RT.  I haven't
> put in instrumentation for worst case.  I think my retries (goto again)
> for concurrent update are a bigger RT problem, but they can be cut down
> significantly.

My RCU Judy array implementation uses a similar retry trick: e.g., for a
delete operation, I first do a RCU lookup of the node I need to delete,
and then take the mutexes needed to protect each internal node that
needs to be modified. Once each mutex is taken, I do a validation that
the nodes that were just locked did not change between lookup and mutex
acquisition. If any of those has been updated concurrently, every lock
is unlocked, and we retry the entire operation. Indeed, this would be
bad for RT. One approach RT could take is to just grab a global mutex
around the entire insert/delete operations on the data structure. It
would improve RT behavior at the expense of update scalability.

The problem with long chains (skip list) is that they hurt RCU lookups
too, and I don't see any workaround to fix this, given it's built into
the skip list design.

> 
> > 
> > FWIW, my benchmark of RCU Judy array with ranges in similar conditions:
> > 
> > - Using 32-bit key length
> > - I first populated 10M ranges of len = 1, sequentially.
> > - Then, I run a reader threads for 10s, which perform random lookups
> >   (all successful) in keys from 0 to 10M.
> 
> Similar, I had 64 bit keys and the lookups were totally random (not all
> successful).  I doubt it matters too much for these numbers.

I'd have to try with 64-bit keys, since it matters for RCU Judy. It
means a successful lookup will need to read twice as many cache lines as
for 32-bit keys. For my range implementation (on top of Judy), every
lookup ends up succeeding, because it either finds an "empty" range or a
populated range, so having match or non-match does not matter much for
range lookups.

> 
> Also, my benchmarks were not just inserting keys but keys pointing to
> things.  So a lookup walked the tree and found an object and then
> returned the object.  radix can just return a key/value without
> dereferencing the value, but that wasn't the case in my runs.

In the specific test I ran, I'm looking up the "range" object, which is
the dereferenced "value" pointer in terms of Judy lookup. My Judy array
implementation represents items as a linked list of structures matching
a given key. This linked list is embedded within the structures,
similarly to the linux/list.h API. Then, if the lookup succeeds, I take
a mutex on the range, and check if it has been concurrently removed.

The only thing I'm not currently doing is to dereference the "private
data" pointer I have in the range. Eventually, I could even embed the
range structure within another structure to use inheritance to save a
pointer indirection. But I wanted to get the basic things to work first
before going too far into optimisation land.

> 
> > > Single threaded rbtree does consistently win across all sizes of lists:
> > > 
> > > skiplist-rcu thread 0 time 0 s 287 ms
> > > rbtree thread       0 time 0 s 215 ms
> > 
> > Are the 287ms and 215ms numbers you get for skiplist-rcu and rbtree the
> > total for 10M add/delete, or is it per operation, or is it a different
> > number of operations altogether ? Also, how many items are present in
> > the data structures in average ?
> 
> The 287 and 215ms were total run time for the mixed bag of operations.
> It had lookups, single  delete then re-insert and bulk (128 keys) delete
> then re-insert.  There were always the same number of items present
> (10M).

When I find time, I'll try to have a closer look at this sequence of
operations, and see if I can reproduce it.

> 
> It looks like you have some strong numbers here, especially considering
> that my test was all in kernel vs userland rcu.

I'm glad to see that Userspace RCU seems to closely match the kernel RCU
performance (well, at least up to 8 cores) ;-) I'm pretty sure we'll
eventually run into the same scalability issues that led to the design
of the Linux kernel Tree RCU implementation some day though.

Thanks,

Mathieu

-- 
Mathieu Desnoyers
EfficiOS Inc.
http://www.efficios.com
--
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html