Re: [PATCHSET v3 0/5] Support for RWF_UNCACHED

Linus Torvalds <torvalds@xxxxxxxxxxxxxxxxxxxx> · Wed, 11 Dec 2019 18:47:25 -0800

On Wed, Dec 11, 2019 at 5:56 PM Matthew Wilcox <willy@xxxxxxxxxxxxx> wrote:
>
> It _should_ be the same order of complexity.  Since there's already
> a page in the page cache, xas_create() just walks its way down to the
> right node calling xas_descend() and then xas_store() does the equivalent
> of __radix_tree_replace().  I don't see a bug that would make it more
> expensive than the old code ... a 10GB file is going to have four levels
> of radix tree node, so it shouldn't even spend that long looking for
> the right node.

The profile seems to say that 85% of the cost of xas_create() is two
instructions:

# lib/xarray.c:143:     return (index >> node->shift) & XA_CHUNK_MASK;
        movzbl  (%rsi), %ecx    # MEM[(unsigned char *)node_13],
MEM[(unsigned char *)node_13]
...
# ./include/linux/xarray.h:1145:        return
rcu_dereference_check(node->slots[offset],
# ./include/linux/compiler.h:199:       __READ_ONCE_SIZE;
        movq    (%rax), %rax    # MEM[(volatile __u64 *)_80], <retval>

where that first load is "node->shift", and the second load seems to
be just the node dereference.

I think both of them are basically just xas_descend(), but it's a bit
hard to read the several levels of inlining.

I suspect the real issue is that going through kswapd means we've lost
almost all locality, and with the random walking the LRU list is
basically entirely unordered so you don't get any advantage of the
xarray having (otherwise) possibly good cache patterns.

So we're just missing in the caches all the time, making it expensive.

              Linus