On Wed, Dec 11, 2019 at 06:47:25PM -0800, Linus Torvalds wrote: > On Wed, Dec 11, 2019 at 5:56 PM Matthew Wilcox <willy@xxxxxxxxxxxxx> wrote: > > It _should_ be the same order of complexity. Since there's already > > a page in the page cache, xas_create() just walks its way down to the > > right node calling xas_descend() and then xas_store() does the equivalent > > of __radix_tree_replace(). I don't see a bug that would make it more > > expensive than the old code ... a 10GB file is going to have four levels > > of radix tree node, so it shouldn't even spend that long looking for > > the right node. > > The profile seems to say that 85% of the cost of xas_create() is two > instructions: > > # lib/xarray.c:143: return (index >> node->shift) & XA_CHUNK_MASK; > movzbl (%rsi), %ecx # MEM[(unsigned char *)node_13], > MEM[(unsigned char *)node_13] > ... > # ./include/linux/xarray.h:1145: return > rcu_dereference_check(node->slots[offset], > # ./include/linux/compiler.h:199: __READ_ONCE_SIZE; > movq (%rax), %rax # MEM[(volatile __u64 *)_80], <retval> > > where that first load is "node->shift", and the second load seems to > be just the node dereference. > > I think both of them are basically just xas_descend(), but it's a bit > hard to read the several levels of inlining. Yep, that's basically the two cache misses you get for each level of the tree. I'd expect the node->shift to be slightly more costly because sometimes the node->slots[offset] is going to be in the same cacheline or the next cacheline after node->shift. > I suspect the real issue is that going through kswapd means we've lost > almost all locality, and with the random walking the LRU list is > basically entirely unordered so you don't get any advantage of the > xarray having (otherwise) possibly good cache patterns. > > So we're just missing in the caches all the time, making it expensive. I have some bad ideas for improving it. 1. We could semi-sort the pages on the LRU list. If we know we're going to remove a bunch of pages, we could take a batch of them off the list, sort them and remove them in-order. This probably wouldn't be terribly effective. 2. We could change struct page to point to the xa_node that holds them. Looking up the page mapping would be page->xa_node->array and then offsetof(i_pages) to get the mapping. 3. Once the maple tree is ready, a 10GB file can actually be held more efficiently in a maple tree than in the radix tree. Because each level of the tree is 128 bytes and (Intel) CPUs fetch a pair of cachelines, there's only one cache miss per level of the tree. So despite the tree being deeper, there are fewer cache misses. With an average of 6 pointers per level of the tree, terminating in a dense node, we'd see: : 6 * 6 * 6 * 6 * 6 * 6 * 6 * 15 : 4199040 (a 10GB file contains 2621440 4kB pages) which is still eight cache misses, so to see an improvement, we'd need to implement another planned improvement, which is allocating a large, dense leaf node of 4kB. That would reduce the height of the tree by 2: : 6 * 6 * 6 * 6 * 6 * 500 : 3888000 4. For consecutive accesses, I'm working on allocating larger pages (16kB, 64kB, 256kB, ...). That will help by reducing the number of deletions from the page cache by a factor of 4/16/64/...