Hi everyone, I've been working on skiplists for indexing off/on for a little while now. For btrfs, I really need a fast way to manage extents that doesn't bog down in lock contention for reads or writes. Dave Chinner has mentioned this as well, so I've cc'd him. Liu Bo started this skiplist code, but at the time it didn't make a huge difference. I've reworked things a bit and used the IOMMU to test things. I ran some benchmarks on a single socket server with two SSD cards to get a baseline of how much it is helping. I tried to keep the basic structure of the IOMMU code, and there are probably many better ways to fix the IOMMU contention. Really I'm just trying to compare with rbtrees, not fix the IOMMU. Basic numbers (all aio/dio): Streaming writes IOMMU off 2,575MB/s skiplist 1,715MB/s rbtree 1,659MB/s Not a huge improvement, but the CPU time was lower. 16 threads, iodepth 10, 20MB random reads IOMMU off 2,861MB/s (mostly idle) skiplist 2,484MB/s (100% system time) rbtree 99MB/s (100% system time) 16 threads, iodepth 10, 20MB random writes IOMMU off 2,548MB/s skiplist 1,649MB/s rbtree 33MB/s I ran this a bunch of times to make sure I got the rbtree numbers right, lowering the thread count did improve the rbtree performance, but the best I could do was around 300MB/s. For skiplists, all of the CPU time is being spent in skiplist_insert_hole, which has a lot of room for improvement. >From here the code needs a lot of testing, and I need to fill out the API to make it a little easier to use. But, I wanted to send this around for comments and to see how other might want to use things. More details on all internals are below. It starts with a basic skiplist implementation where: * skiplist nodes are dynamically sized. There is a slab cache for each node height, so every node is actively using all of the pointers allocated for it. * Each node has a back pointer as well as a forward pointer * Each skiplist node stores an array of pointers to items. This is a huge speed improvement because the array of keys is much more cache friendly. Since the array of item pointers is something of an extension to the basic skiplist, I've separated them out into two structs. First sl_node (just the indexing pointers) and then sl_leaf, which has an sl_node and the array of item pointers. There's no cache benefit to the leaves if I'm just storing an array of pointers though, so it also has an array of keys: unsigned long keys[SKIP_KEYS_PER_NODE]; struct sl_slot *ptrs[SKIP_KEYS_PER_NODE]; The keys array is used for the first search pass, and based on that result I narrow things down and only have to follow one or two pointers from the corresponding ptrs array. The sl_slot is very simple: struct sl_slot { unsigned long key; unsigned long size; } The idea is to embed the sl_slot in your struct, giving us something like this: sl_leaf ________________ | node ptr N | | .... | | node ptr 0 | |_______________| | | keys | | |___|________|__| | . | ptrs |. | |___|________|__| / \ / \ ------------- ------------ | sl_slot 0 | | sl_slot N | | | | | | data | | data | ------------- ------------- your struct This basic setup gives us similar performance to rbtrees, but uses less memory. The performance varies (a lot really) with the size of the keys array. Locking is a mixture of RCU and per-node locking. For searches, it can be pure RCU, or it can use the per-node lock to synchronize the final search though the last leaf. But, everything from the skiplist head to that final node is RCU either way. Insert locking is slightly different. Before the insert is started, a new node is preallocated. The height of this node is used to pick the level where we start locking nodes during the insert. Much of the time we're able to get through huge portions of the list without any locks. For deletes, it does an RCU search for the leaf, and hold the per-node lock while we remove a specific slot. If the leaf is empty it is marked as dead and then unlinked from the skiplist level by level. All the locking and tries does slow things down a bit, but the benefits for concurrent access make a big difference. -- To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at http://vger.kernel.org/majordomo-info.html