Re: Truncate regression due to commit 69b6c1319b6

Jan Kara <jack@xxxxxxx> · Wed, 27 Feb 2019 12:27:21 +0100

On Tue 26-02-19 09:27:44, Matthew Wilcox wrote:
> On Tue, Feb 26, 2019 at 05:56:28PM +0100, Jan Kara wrote:
> > after some peripeties, I was able to bisect down to a regression in
> > truncate performance caused by commit 69b6c1319b6 "mm: Convert truncate to
> > XArray".
> 
> [...]
> 
> > I've gathered also perf profiles but from the first look they don't show
> > anything surprising besides xas_load() and xas_store() taking up more time
> > than original counterparts did. I'll try to dig more into this but any idea
> > is appreciated.
> 
> Well, that's a short and sweet little commit.  Stripped of comment
> changes, it's just:
> 
> -       struct radix_tree_node *node;
> -       void **slot;
> +       XA_STATE(xas, &mapping->i_pages, index);
>  
> -       if (!__radix_tree_lookup(&mapping->i_pages, index, &node, &slot))
> +       xas_set_update(&xas, workingset_update_node);
> +       if (xas_load(&xas) != entry)
>                 return;
> -       if (*slot != entry)
> -               return;
> -       __radix_tree_replace(&mapping->i_pages, node, slot, NULL,
> -                            workingset_update_node);
> +       xas_store(&xas, NULL);

Yes, the code change is small. Thanks to you splitting the changes, the
regression is easier to analyze so thanks for that :)

> I have a few reactions to this:
> 
> 1. I'm concerned that the XArray may generally be slower than the radix
> tree was.  I didn't notice that in my testing, but maybe I didn't do
> the right tests.

So one difference I've noticed when staring into the code and annotated
perf traces is that xas_store() will call xas_init_marks() when stored
entry is 0. __radix_tree_replace() didn't do this. And the cache misses we
take from checking tags do add up. After hacking the code in xas_store() so
that __clear_shadow_entry() does not touch tags, I get around half of the
regression back. For now I didn't think how to do this so that the API
remains reasonably clean. So now we are at:

COMMIT      AVG            STDDEV
a97e7904c0  1431256.500000 1489.361759
69b6c1319b  1566944.000000 2252.692877
notaginit   1483740.700000 7680.583455

> 2. The setup overhead of the XA_STATE might be a problem.
> If so, we can do some batching in order to improve things.
> I suspect your test is calling __clear_shadow_entry through the
> truncate_exceptional_pvec_entries() path, which is already a batch.
> Maybe something like patch [1] at the end of this mail.

So this apparently contributes as well but not too much. With your patch
applied on top of 'notaginit' kernel above I've got to:

batched-xas 1473900.300000 950.439377

> 3. Perhaps we can actually get rid of truncate_exceptional_pvec_entries().
> It seems a little daft for page_cache_delete_batch() to skip value
> entries, only for truncate_exceptional_pvec_entries() to erase them in
> a second pass.  Truncation is truncation, and perhaps we can handle all
> of it in one place?
> 
> 4. Now that calling through a function pointer is expensive, thanks to
> Spectre/Meltdown/..., I've been considering removing the general-purpose
> update function, which is only used by the page cache.  Instead move parts
> of workingset.c into the XArray code and use a bit in the xa_flags to
> indicate that the node should be tracked on an LRU if it contains only
> value entries.

I agree these two are good ideas to improve the speed. But old radix tree
code has these issues as well so they are not the reason of this
regression. So I'd like to track down where Xarray code is slower first.

I'm going to dig more into annotated profiles...

								Honza
-- 
Jan Kara <jack@xxxxxxxx>
SUSE Labs, CR