Hi, On Thu 02-05-13 22:02:11, Chris Mason wrote: > I've been working on skiplists for indexing off/on for a little while > now. For btrfs, I really need a fast way to manage extents that doesn't > bog down in lock contention for reads or writes. Dave Chinner has > mentioned this as well, so I've cc'd him. But I guess you still need writer-writer exclusion, don't you? Or are you able to do some partial locking of the structure to allow parallel modifications? > Liu Bo started this skiplist code, but at the time it didn't make > a huge difference. I've reworked things a bit and used the IOMMU > to test things. I ran some benchmarks on a single socket > server with two SSD cards to get a baseline of how much it is helping. > > I tried to keep the basic structure of the IOMMU code, and there are probably > many better ways to fix the IOMMU contention. Really I'm just trying to > compare with rbtrees, not fix the IOMMU. There also exist some research papers on RCU friendly RB-trees (or other trees). Maybe they would be interesting to try? The basic trick is to use a copy-on-write when you need to do a rotation. This is slightly impractical because it requires memory allocation (or alternatively doubles memory footprint of an rb node) but you need to do a similar stuff in skip lists anyway. Just an idea... Honza > Basic numbers (all aio/dio): > > Streaming writes > IOMMU off 2,575MB/s > skiplist 1,715MB/s > rbtree 1,659MB/s > > Not a huge improvement, but the CPU time was lower. > > 16 threads, iodepth 10, 20MB random reads > IOMMU off 2,861MB/s (mostly idle) > skiplist 2,484MB/s (100% system time) > rbtree 99MB/s (100% system time) > > 16 threads, iodepth 10, 20MB random writes > IOMMU off 2,548MB/s > skiplist 1,649MB/s > rbtree 33MB/s > > I ran this a bunch of times to make sure I got the rbtree numbers right, > lowering the thread count did improve the rbtree performance, but the > best I could do was around 300MB/s. For skiplists, all of the CPU time > is being spent in skiplist_insert_hole, which has a lot of room for > improvement. > > From here the code needs a lot of testing, and I need to fill out the API > to make it a little easier to use. But, I wanted to send this around > for comments and to see how other might want to use things. More > details on all internals are below. > > It starts with a basic skiplist implementation where: > > * skiplist nodes are dynamically sized. There is a slab cache for each > node height, so every node is actively using all of the pointers allocated > for it. > > * Each node has a back pointer as well as a forward pointer > > * Each skiplist node stores an array of pointers to items. This is a huge > speed improvement because the array of keys is much more cache friendly. > > Since the array of item pointers is something of an extension > to the basic skiplist, I've separated them out into two structs. > First sl_node (just the indexing pointers) and then sl_leaf, > which has an sl_node and the array of item pointers. > > There's no cache benefit to the leaves if I'm just storing > an array of pointers though, so it also has an array > of keys: > > unsigned long keys[SKIP_KEYS_PER_NODE]; > struct sl_slot *ptrs[SKIP_KEYS_PER_NODE]; > > The keys array is used for the first search pass, and based > on that result I narrow things down and only have to follow > one or two pointers from the corresponding ptrs array. > > The sl_slot is very simple: > > struct sl_slot { > unsigned long key; > unsigned long size; > } > > The idea is to embed the sl_slot in your struct, giving us something like > this: > > sl_leaf > ________________ > | node ptr N | > | .... | > | node ptr 0 | > |_______________| > | | keys | | > |___|________|__| > | . | ptrs |. | > |___|________|__| > / \ > / \ > ------------- ------------ > | sl_slot 0 | | sl_slot N | > | | | | > | data | | data | > ------------- ------------- > your > struct > > This basic setup gives us similar performance to rbtrees, but uses less memory. > The performance varies (a lot really) with the size of the keys array. > > Locking is a mixture of RCU and per-node locking. For searches, > it can be pure RCU, or it can use the per-node lock to synchronize > the final search though the last leaf. But, everything from the > skiplist head to that final node is RCU either way. > > Insert locking is slightly different. Before the insert is started, > a new node is preallocated. The height of this node is used to > pick the level where we start locking nodes during the insert. Much > of the time we're able to get through huge portions of the list > without any locks. > > For deletes, it does an RCU search for the leaf, and hold the per-node lock > while we remove a specific slot. If the leaf is empty it is marked as dead and > then unlinked from the skiplist level by level. > > All the locking and tries does slow things down a bit, but the benefits > for concurrent access make a big difference. > > -- > To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in > the body of a message to majordomo@xxxxxxxxxxxxxxx > More majordomo info at http://vger.kernel.org/majordomo-info.html -- Jan Kara <jack@xxxxxxx> SUSE Labs, CR -- To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at http://vger.kernel.org/majordomo-info.html