[PATCH RFC 0/2] skiplists for range indexes

Chris Mason <chris.mason@xxxxxxxxxxxx> · Thu, 2 May 2013 22:02:11 -0400

Hi everyone,

I've been working on skiplists for indexing off/on for a little while
now.  For btrfs, I really need a fast way to manage extents that doesn't
bog down in lock contention for reads or writes.  Dave Chinner has
mentioned this as well, so I've cc'd him.

Liu Bo started this skiplist code, but at the time it didn't make
a huge difference.  I've reworked things a bit and used the IOMMU
to test things.  I ran some benchmarks on a single socket
server with two SSD cards to get a baseline of how much it is helping.

I tried to keep the basic structure of the IOMMU code, and there are probably
many better ways to fix the IOMMU contention.  Really I'm just trying to
compare with rbtrees, not fix the IOMMU.

Basic numbers (all aio/dio):

Streaming writes
IOMMU off  2,575MB/s
skiplist   1,715MB/s
rbtree     1,659MB/s

Not a huge improvement, but the CPU time was lower.

16 threads, iodepth 10, 20MB random reads
IOMMU off  2,861MB/s (mostly idle)
skiplist   2,484MB/s (100% system time)
rbtree        99MB/s (100% system time)

16 threads, iodepth 10, 20MB random writes
IOMMU off  2,548MB/s
skiplist   1,649MB/s
rbtree        33MB/s

I ran this a bunch of times to make sure I got the rbtree numbers right,
lowering the thread count did improve the rbtree performance, but the
best I could do was around 300MB/s.  For skiplists, all of the CPU time
is being spent in skiplist_insert_hole, which has a lot of room for
improvement.

>From here the code needs a lot of testing, and I need to fill out the API
to make it a little easier to use.  But, I wanted to send this around
for comments and to see how other might want to use things.  More
details on all internals are below.

It starts with a basic skiplist implementation where:

* skiplist nodes are dynamically sized.  There is a slab cache for each
node height, so every node is actively using all of the pointers allocated
for it.

* Each node has a back pointer as well as a forward pointer

* Each skiplist node stores an array of pointers to items.  This is a huge
speed improvement because the array of keys is much more cache friendly.

Since the array of item pointers is something of an extension
to the basic skiplist, I've separated them out into two structs.
First sl_node (just the indexing pointers) and then sl_leaf,
which has an sl_node and the array of item pointers.

There's no cache benefit to the leaves if I'm just storing
an array of pointers though, so it also has an array
of keys:

        unsigned long keys[SKIP_KEYS_PER_NODE];
	struct sl_slot *ptrs[SKIP_KEYS_PER_NODE];

The keys array is used for the first search pass, and based
on that result I narrow things down and only have to follow
one or two pointers from the corresponding ptrs array.

The sl_slot is very simple:

struct sl_slot {
	unsigned long key;
	unsigned long size;
}

The idea is to embed the sl_slot in your struct, giving us something like
this:

                   sl_leaf
                ________________
                | node ptr N    |
                |   ....        |
                | node ptr 0    |
                |_______________|
                |   | keys   |  |
                |___|________|__|
                | . | ptrs   |. |
                |___|________|__|
                 /             \
                /               \
         -------------            ------------
         | sl_slot 0 |           | sl_slot N |
         |           |           |           |
         | data      |           |  data     |
         -------------           -------------
	    your
	    struct

This basic setup gives us similar performance to rbtrees, but uses less memory.
The performance varies (a lot really) with the size of the keys array.

Locking is a mixture of RCU and per-node locking.  For searches,
it can be pure RCU, or it can use the per-node lock to synchronize
the final search though the last leaf.  But, everything from the
skiplist head to that final node is RCU either way.

Insert locking is slightly different.  Before the insert is started,
a new node is preallocated.  The height of this node is used to
pick the level where we start locking nodes during the insert.  Much
of the time we're able to get through huge portions of the list
without any locks.

For deletes, it does an RCU search for the leaf, and hold the per-node lock
while we remove a specific slot.  If the leaf is empty it is marked as dead and
then unlinked from the skiplist level by level.

All the locking and tries does slow things down a bit, but the benefits
for concurrent access make a big difference.

--
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html