On Thu, Mar 14, 2019 at 09:43:43AM -0700, Davidlohr Bueso wrote: > On Wed, 13 Mar 2019, Matthew Wilcox wrote: > > > It's probably worth listing the advantages of the Maple Tree over the > > rbtree. > > I'm not familiar with maple trees, are they referred to by another name? > (is this some sort of B-tree?). Google just shows me real trees. It is a B-tree variant which supports ranges as a first-class citizen (single elements are ranges of length 1). It optimises for ranges which are adjacent to each other, and does not support overlapping ranges. It supports RCU lookup and embeds a spinlock which must be held for modification. There's a lot of detail I can go into, but let's leave it at that for an introduction. > > - Shallower tree. A 1000-entry rbtree is 10 levels deep. A 1000-entry > > Maple Tree is 5 levels deep (I did a more detailed analysis in an > > earlier email thread with Laurent and I can present it if needed). > > I'd be interested in reading on that. (see the last two paragraphs of the mail for that analysis) > > - O(1) prev/next > > - Lookups under the RCU lock > > > > There're some second-order effects too; by using externally allocated > > nodes, we avoid disturbing other VMAs when inserting/deleting, and we > > avoid bouncing cachelines around (eg the VMA which happens to end up > > at the head of the tree is accessed by every lookup in the tree because > > it's on the way to every other node). > > How would maple trees deal with the augmented vma tree (vma gaps) trick > we use to optimize get_unmapped_area? The fundamental unit of the Maple Tree is a 128-byte node. A leaf node is laid out like this: struct maple_range_64 { struct maple_node *parent; void __rcu *slot[8]; u64 pivot[7]; }; The pivots are stored in ascending order; if the search index is less than pivot[i], then the value (ie the vma pointer) you are searching for is stored in slot[i]. Non-leaf nodes (for trees which support range allocations) are laid out like this: struct maple_arange_64 { struct maple_node *parent; u64 gaps[5]; void __rcu *slot[5]; u64 pivot[4]; }; gaps[i] stores the largest run of NULL pointers in the subtree rooted at slot[i]. When searching for an empty range of at least N, you can skip any subtree which has gaps[i] < N. Here's a simple case: $ ldd `which cat` linux-vdso.so.1 (0x00007ffc867fc000) libc.so.6 => /lib/x86_64-linux-gnu/libc.so.6 (0x00007f2c8cc6e000) /lib64/ld-linux-x86-64.so.2 (0x00007f2c8ce69000) 'cat /proc/self/maps | wc' gives me 25 mappings. They look like this: $ cat /proc/self/maps |cut -f 1 -d ' ' 55c414785000-55c414787000 55c414787000-55c41478c000 55c41478c000-55c41478e000 55c41478f000-55c414790000 55c414790000-55c414791000 55c4159dd000-55c4159fe000 7fa5f6527000-7fa5f680c000 7fa5f680c000-7fa5f682e000 7fa5f682e000-7fa5f6976000 7fa5f6976000-7fa5f69c2000 7fa5f69c2000-7fa5f69c3000 7fa5f69c3000-7fa5f69c7000 7fa5f69c7000-7fa5f69c9000 7fa5f69c9000-7fa5f69cd000 7fa5f69cd000-7fa5f69cf000 7fa5f69d9000-7fa5f69fb000 7fa5f69fb000-7fa5f69fc000 7fa5f69fc000-7fa5f6a1a000 7fa5f6a1a000-7fa5f6a22000 7fa5f6a22000-7fa5f6a23000 7fa5f6a23000-7fa5f6a24000 7fa5f6a24000-7fa5f6a25000 7ffe54a3c000-7ffe54a5d000 7ffe54a7b000-7ffe54a7e000 7ffe54a7e000-7ffe54a80000 We'd represent this in the Maple Tree as: 0-55c414785000 -> NULL 55c414785000-55c414787000 -> vma 55c414787000-55c41478c000 -> vma ... 55c414790000-55c414791000 -> vma 55c414791000-7fa5f6527000 -> NULL 7fa5f6527000-7fa5f680c000 -> vma ... 7fa5f69cd000-7fa5f69cf000 -> vma 7fa5f69cf000-7fa5f69d9000 -> NULL 7fa5f69d9000-7fa5f69fb000 -> vma ... 7fa5f6a24000-7fa5f6a25000 -> vma 7fa5f6a25000-7ffe54a3c000 -> NULL 7ffe54a3c000-7ffe54a5d000 -> vma 7ffe54a5d000-7ffe54a7b000 -> NULL 7ffe54a7b000-7ffe54a7e000 -> vma 7ffe54a7e000-7ffe54a80000 -> vma 7ffe54a80000-ffffffffffff -> NULL so the maple tree stores 6 ranges that point to NULL in addition to the 25 that're stored by the rbtree. Because they're allocated sequentially, there won't be any wastage in the maple tree caused by items shifting around. That means we'll get 8 per leaf node, so just 4 leaf nodes needed to store 31 ranges, all stored in a single root node. That means to get from root to an arbitrary VMA is just 3 pointer dereferences, versus 3.96 pointer dereferences for an optimally balanced rbtree. For a process with 1000 VMAs, we'll have approximately 167 leaf nodes (assuming approximately 6 of the 8 pointers are used per node) arranged into a tree of height 5, with about 44 non-leaf nodes needed to manage those 166 leaf-nodes. That'll be 6 pointers to follow per walk of the tree (if it were optimally arranged, it'd be 125 leaf nodes plus 25 + 5 + 1 non-leaf nodes and 5 pointers to follow, but it's unrealistic to assume it'll be optimally arranged, and this neglects the NULL ranges which will also need to be stored). The rbtree has a 1/1000 chance of 1 pointer dereference, a 2/1000 chance of 2 pointers, 4/1000 chance of 3 pointers, 8/1000 4 pointers, 16/1000 5 pointers, 32/1000 6 pointers, 64/1000 7 pointers, 128/1000 8 pointers, 256/1000 9 pointers, 489/1000 10 pointers. Amortised, that's 8.987 pointers to look up a random VMA (assuming the rbtree is fully balanced; I haven't checked how unbalanced the rbtree can actually become).