Re: Truncate regression due to commit 69b6c1319b6

Nicholas Piggin <npiggin@xxxxxxxxx> · Wed, 27 Feb 2019 16:03:25 +1000

Matthew Wilcox's on February 27, 2019 3:27 am:
> On Tue, Feb 26, 2019 at 05:56:28PM +0100, Jan Kara wrote:
>> after some peripeties, I was able to bisect down to a regression in
>> truncate performance caused by commit 69b6c1319b6 "mm: Convert truncate to
>> XArray".
> 
> [...]
> 
>> I've gathered also perf profiles but from the first look they don't show
>> anything surprising besides xas_load() and xas_store() taking up more time
>> than original counterparts did. I'll try to dig more into this but any idea
>> is appreciated.
> 
> Well, that's a short and sweet little commit.  Stripped of comment
> changes, it's just:
> 
> -       struct radix_tree_node *node;
> -       void **slot;
> +       XA_STATE(xas, &mapping->i_pages, index);
>  
> -       if (!__radix_tree_lookup(&mapping->i_pages, index, &node, &slot))
> +       xas_set_update(&xas, workingset_update_node);
> +       if (xas_load(&xas) != entry)
>                 return;
> -       if (*slot != entry)
> -               return;
> -       __radix_tree_replace(&mapping->i_pages, node, slot, NULL,
> -                            workingset_update_node);
> +       xas_store(&xas, NULL);
> 
> I have a few reactions to this:
> 
> 1. I'm concerned that the XArray may generally be slower than the radix
> tree was.  I didn't notice that in my testing, but maybe I didn't do
> the right tests.
> 
> 2. The setup overhead of the XA_STATE might be a problem.
> If so, we can do some batching in order to improve things.
> I suspect your test is calling __clear_shadow_entry through the
> truncate_exceptional_pvec_entries() path, which is already a batch.
> Maybe something like patch [1] at the end of this mail.

One nasty thing about the XA_STATE stack object as opposed to just
passing the parameters (in the same order) down to children is that 
you get the same memory accessed nearby, but in different ways
(different base register, offset, addressing mode etc). Which can
reduce effectiveness of memory disambiguation prediction, at least
in cold predictor case.

I've seen (on some POWER CPUs at least) flushes due to aliasing
access in some of these xarray call chains, although no idea if
that actually makes a noticable difference in microbenchmark like
this.

But it's not the greatest pattern to use for passing to low level
performance critical functions :( Ideally the compiler could just
do a big LTO pass right at the end and unwind it all back into
registers and fix everything, but that will never happen.

Thanks,
Nick