Re: [PATCH RFC 0/2] rcu skiplists v2

Mathieu Desnoyers <mathieu.desnoyers@xxxxxxxxxxxx> · Thu, 27 Jun 2013 08:12:49 -0400

* Dave Chinner (david@xxxxxxxxxxxxx) wrote:
> On Wed, Jun 26, 2013 at 10:29:36PM -0400, Mathieu Desnoyers wrote:
> > * Chris Mason (clmason@xxxxxxxxxxxx) wrote:
> > > Quoting Mathieu Desnoyers (2013-06-26 19:02:18)
> > > > FWIW, my benchmark of RCU Judy array with ranges in similar conditions:
> > > > 
> > > > - Using 32-bit key length
> > > > - I first populated 10M ranges of len = 1, sequentially.
> > > > - Then, I run a reader threads for 10s, which perform random lookups
> > > >   (all successful) in keys from 0 to 10M.
> > > 
> > > Similar, I had 64 bit keys and the lookups were totally random (not all
> > > successful).  I doubt it matters too much for these numbers.
> > 
> > I'd have to try with 64-bit keys, since it matters for RCU Judy. It
> > means a successful lookup will need to read twice as many cache lines as
> > for 32-bit keys. For my range implementation (on top of Judy), every
> > lookup ends up succeeding, because it either finds an "empty" range or a
> > populated range, so having match or non-match does not matter much for
> > range lookups.
> 
> Yeah, I only care about performance with 64 bit keys, sparse
> keyspace population and random extent lengths. Random lookup
> performance is more important than insert and delete, though I do
> have cases where bulk sequential insert and removal are important,
> too.

One thing I noticed about 64-bit keys in Judy though: let's say we only
use part of the key space (e.g. lower 32-bits). Even with a 64-bit key
lookup, we end up always touching the same cache-lines for the top level
nodes, and therefore, the number of cache lines we need to bring from
memory will be quite close to that of a Judy array with 32-bit keys.
I'll have to confirm this with benchmarks though.

> 
> > > Also, my benchmarks were not just inserting keys but keys pointing to
> > > things.  So a lookup walked the tree and found an object and then
> > > returned the object.  radix can just return a key/value without
> > > dereferencing the value, but that wasn't the case in my runs.
> > 
> > In the specific test I ran, I'm looking up the "range" object, which is
> > the dereferenced "value" pointer in terms of Judy lookup. My Judy array
> > implementation represents items as a linked list of structures matching
> > a given key. This linked list is embedded within the structures,
> > similarly to the linux/list.h API. Then, if the lookup succeeds, I take
> > a mutex on the range, and check if it has been concurrently removed.
> 
> Does that mean that each "extent" that is indexed has a list head
> embedded in it? That blows the size of the index out when all I
> might want to store in the tree is a 64 bit value for a block
> mapping...

My implementation currently has chaining of duplicates for genericity.
I'm keeping it simple (no special-cases) on purpose until we find out if
the Judy approach is interesting at all. We could quite easily create a
variant of the RCU Judy array augmented with range support that has this
as node:

/* The 64-bit start of range value is implicit within the Judy Array */
struct {
        uint64_t end;           /* end of range */
        spinlock_t lock;        /* lock updates on range */
        struct rcu_head head;   /* if call_rcu() is required for reclaim */
        unsigned int flags:2;   /* range is free, allocated, or removed */
};

Depending on what you are ready to let go in terms of scalability and
RCU reclaim batching, the lock and rcu_head could be removed. This ends
up being a trade-off between update scalability and memory footprint. So
if you go all the way for low memory footprint with a single lock
covering all updates, and use synchronize_rcu() to perform reclaim, you
end up with:

/* The 64-bit start of range value is implicit within the Judy Array */
struct {
        uint64_t end;           /* end of range */
        unsigned int flags:2;   /* range is free, allocated, or removed */
};

We could probably encode the flags into unused low-order bits of the
pointer to the range if needed.

> 
> FWIW, when a bunch of scalability work was done on xfs_repair years
> ago, judy arrays were benchmarked for storing extent lists that
> tracked free/used space. We ended up using a btree, because while it
> was slower than the original bitmap code, it was actually faster
> than the highly optimised judy array library and at the scale we
> needed there was no memory usage advantage to using a judy array,
> either...
> 
> So I'm really starting to wonder if it'd be simpler for me just to
> resurrect the old RCU friendly btree code Peter Z wrote years ago
> (http://programming.kicks-ass.net/kernel-patches/vma_lookup/) and
> customise it for the couple of uses I have in XFS....

Balanced tree structures end up having contention near the root if you
want to perform concurrent updates on them. The advantage of RCU skip
lists and Judy arrays over trees is that no rebalancing is required, so
updates can be performed concurrently when touching different areas of
the key space.

Thanks,

Mathieu

> 
> Cheers,
> 
> Dave.
> -- 
> Dave Chinner
> david@xxxxxxxxxxxxx

-- 
Mathieu Desnoyers
EfficiOS Inc.
http://www.efficios.com
--
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html