Re: [PATCH RFC 0/2] rcu skiplists v2

Mathieu Desnoyers <mathieu.desnoyers@xxxxxxxxxxxx> · Wed, 26 Jun 2013 19:02:18 -0400

Hi Chris,

* Chris Mason (clmason@xxxxxxxxxxxx) wrote:
> Hi everyone,
> 
> This is another iteration of my skiplists patch, with some bugs fixed
> and a few missing parts of the API done.  The biggest change is in the
> insertion code, where I now link from the bottom up once I find the
> proper insertion point.  This makes it possible to write custom
> insertion routines that allow duplicate keys.
> 
> It also means insertion doesn't have to lock and track all the nodes
> from the top as it searches.  In the likely event that we're able to
> insert into a free spot in an existing leaf, it only needs to take one
> lock.

nice, I did pretty much the same for RCU Judy arrays. I made insertion
and deletion take only the localized locks required, nesting from bottom
to top.

> The IOMMU part of the patch is updated slightly, but still not using all
> the new bits in the API.  This is mostly because the IOMMU part is going
> to change later on and I'm really only using it for testing now.
> 
> For benchmarking, the IOMMU numbers are slightly higher than last time,
> but not more than 5% or so.  This is because the bottleneck is still
> skiplist_insert_hole(), which I haven't really changed in this round.
> 
> Most of my benchmarking now is with the skiplist_test module, which has
> expanded to a few new cases.  For random operations, my performance is
> slightly slower than rbtree when single threaded.
> 
> Random lookups on 10 million items: (inserts are similar)
> 
> rbtree       random lookup time 4 s 327 ms
> skiplist-rcu random lookup time 5 s 175 ms 

These benchmarks are only considering averages.

I'm worried that the skiplist approach has chances to create long chains
(probabilistically non-zero). Clearly, the lack of bound is something
the RT guys might frown upon. I'd be curious to see how long the worse
chains encountered can be on, say, 1000 machines stress-testing this for
a week.

FWIW, my benchmark of RCU Judy array with ranges in similar conditions:

- Using 32-bit key length
- I first populated 10M ranges of len = 1, sequentially.
- Then, I run a reader threads for 10s, which perform random lookups
  (all successful) in keys from 0 to 10M.

The average lookup time is (on a per-core basis):

- single-thread: 522 ns / lookup (  19122393 lookups in 10s)
- 2 threads:     553 ns / lookup ( ~36134994 lookups in 10s on 2 cores)
- 8 threads:     696 ns / lookup (~114854104 lookups in 10s on 8 cores)

We get an efficiency of 0.79 when going from 2 to 8 threads (efficiency
of 1 being linear scalability).

Performed on this HW:
8-core
model name      : Intel(R) Xeon(R) CPU           E5405  @ 2.00GHz

What I gather from your skiplist-rcu numbers, it seems to take
517ns/lookup (avg), and for rbtree: 432ns/lookup. I'd be curious to
test all these alternatives in the same test bench.

> 
> The new API makes it easier to do sequential operations.  Here is a walk
> through the 10 million item list:
> 
> skiplist-rcu check time 0 s 79 ms
> rbtree check time      0 s 295 ms
> 
> And sequential insert:
> 
> kiplist-rcu fill time 1 s 599 ms
> rbtree fill time      1 s 875 ms
> 
> The benchmark does random lookup/deletion/insertion rounds.  Most of the
> operations done are either deletion or insertion, so I'm not skewing the
> results with the easy rcu lookup operation.

We could do helpers for sequential traversal of ranges, they are not
implemented in RCU Judy array yet though.

> 
> Single threaded rbtree does consistently win across all sizes of lists:
> 
> skiplist-rcu thread 0 time 0 s 287 ms
> rbtree thread       0 time 0 s 215 ms

Are the 287ms and 215ms numbers you get for skiplist-rcu and rbtree the
total for 10M add/delete, or is it per operation, or is it a different
number of operations altogether ? Also, how many items are present in
the data structures in average ?

> But once we go up to two threads, skiplists win:
> 
> skiplist-rcu thread 1 time 0 s 299 ms
> rbtree thread 1 time       0 s 499 ms
> 
> At 8 threads, the skiplists don't get linear scaling, but it's really
> pretty good.  Since I'm doing random operations on a wide key space,
> the locking skiplist variant is also fast:
> 
> skiplist-rcu     thread 7 time 0 s 379 ms
> skiplist-locking thread 7 time 0 s 425 ms
> rbtree           thread 7 time 3 s 711 ms

Hopefully I'm having a setup not too different from yours. Here is my
benchmark of the RCU Judy Array insert operations. Please note that I
have not focused on optimizing the update-side, I've made design choices
aimed towards making lookups as fast as possible, sometimes at the
expense of update overhead.

I used a keyspace of 400M, just so the number of inserts that end up
being a simple rcu lookup are non significant. I'm still using a 32-bit
key length. I'm adding ranges of length 1.

Here is the average time taken per insertion (on a per-core basis):

- single-thread: 15000 ns/insert (  645474 inserts in 10s)
- 2 threads:     17000 ns/insert (~1188232 inserts in 10s on 2 cores)
- 8 threads:     25000 ns/insert (~3161024 inserts in 10s on 8 cores)

We get an efficiency of 0.66 when going from 2 to 8 threads (efficiency
of 1 being linear scalability).

> 
> My test machine is 8 cores, so at 16 threads we're into HT:
> 
> skiplist-rcu thread     15 time 0 s 583 ms
> skiplist-locking thread 15 time 0 s 559 ms
> rbtree thread           15 time 7 s 423 ms

Skipping this test, since my test machine does not have hyperthreading.

> 
> It's not all sunshine though.  If I shrink the keyspace down to 1000
> keys or less, there is more contention on the node locks and we're tied
> (or slightly worse) than rbtree.  In that kind of workload it makes
> sense to add a big fat lock around the skiplist, or just stick with
> rbtrees.

I'd expect a similar behavior with RCU judy arrays with respect to small
keyspace (single lock around updates might perform better that locks
distributed within the data). However, even with a small keyspace (and
few nodes), I expect that RCU Judy Array and skiplist-rcu will win over
non-RCU rbtree for lookup speed when we have concurrent updates and
lookups from many cores.

Feedback is welcome,

Thanks!

Mathieu

> 
> The skiplists do have locking and rcu variants of lookup operations.
> The locking ones are built on top of the rcu ones, and you can use them
> both at the same time.
> 
> Patches follow.
> 
> -chris
> 

-- 
Mathieu Desnoyers
EfficiOS Inc.
http://www.efficios.com
--
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html