On Wed, Dec 11, 2019 at 5:56 PM Matthew Wilcox <willy@xxxxxxxxxxxxx> wrote: > > It _should_ be the same order of complexity. Since there's already > a page in the page cache, xas_create() just walks its way down to the > right node calling xas_descend() and then xas_store() does the equivalent > of __radix_tree_replace(). I don't see a bug that would make it more > expensive than the old code ... a 10GB file is going to have four levels > of radix tree node, so it shouldn't even spend that long looking for > the right node. The profile seems to say that 85% of the cost of xas_create() is two instructions: # lib/xarray.c:143: return (index >> node->shift) & XA_CHUNK_MASK; movzbl (%rsi), %ecx # MEM[(unsigned char *)node_13], MEM[(unsigned char *)node_13] ... # ./include/linux/xarray.h:1145: return rcu_dereference_check(node->slots[offset], # ./include/linux/compiler.h:199: __READ_ONCE_SIZE; movq (%rax), %rax # MEM[(volatile __u64 *)_80], <retval> where that first load is "node->shift", and the second load seems to be just the node dereference. I think both of them are basically just xas_descend(), but it's a bit hard to read the several levels of inlining. I suspect the real issue is that going through kswapd means we've lost almost all locality, and with the random walking the LRU list is basically entirely unordered so you don't get any advantage of the xarray having (otherwise) possibly good cache patterns. So we're just missing in the caches all the time, making it expensive. Linus