Re: Truncate regression due to commit 69b6c1319b6

Matthew Wilcox <willy@xxxxxxxxxxxxx> · Wed, 27 Feb 2019 04:24:51 -0800

On Wed, Feb 27, 2019 at 12:27:21PM +0100, Jan Kara wrote:
> On Tue 26-02-19 09:27:44, Matthew Wilcox wrote:
> > On Tue, Feb 26, 2019 at 05:56:28PM +0100, Jan Kara wrote:
> > > after some peripeties, I was able to bisect down to a regression in
> > > truncate performance caused by commit 69b6c1319b6 "mm: Convert truncate to
> > > XArray".
> > 
> > [...]
> > 
> > > I've gathered also perf profiles but from the first look they don't show
> > > anything surprising besides xas_load() and xas_store() taking up more time
> > > than original counterparts did. I'll try to dig more into this but any idea
> > > is appreciated.
> > 
> > Well, that's a short and sweet little commit.  Stripped of comment
> > changes, it's just:
> > 
> > -       struct radix_tree_node *node;
> > -       void **slot;
> > +       XA_STATE(xas, &mapping->i_pages, index);
> >  
> > -       if (!__radix_tree_lookup(&mapping->i_pages, index, &node, &slot))
> > +       xas_set_update(&xas, workingset_update_node);
> > +       if (xas_load(&xas) != entry)
> >                 return;
> > -       if (*slot != entry)
> > -               return;
> > -       __radix_tree_replace(&mapping->i_pages, node, slot, NULL,
> > -                            workingset_update_node);
> > +       xas_store(&xas, NULL);
> 
> Yes, the code change is small. Thanks to you splitting the changes, the
> regression is easier to analyze so thanks for that :)

That was why I split the patches so small.  I'm kind of glad it paid off ...
not glad to have caused a performance regression!

> > I have a few reactions to this:
> > 
> > 1. I'm concerned that the XArray may generally be slower than the radix
> > tree was.  I didn't notice that in my testing, but maybe I didn't do
> > the right tests.
> 
> So one difference I've noticed when staring into the code and annotated
> perf traces is that xas_store() will call xas_init_marks() when stored
> entry is 0. __radix_tree_replace() didn't do this. And the cache misses we
> take from checking tags do add up. After hacking the code in xas_store() so
> that __clear_shadow_entry() does not touch tags, I get around half of the
> regression back. For now I didn't think how to do this so that the API
> remains reasonably clean. So now we are at:
> 
> COMMIT      AVG            STDDEV
> a97e7904c0  1431256.500000 1489.361759
> 69b6c1319b  1566944.000000 2252.692877
> notaginit   1483740.700000 7680.583455

Well, that seems worth doing.  For the page cache case, we know that
shadow entries have no tags set (at least right now), so it seems
reasonable to move the xas_init_marks() from xas_store() to its various
callers.

> > 2. The setup overhead of the XA_STATE might be a problem.
> > If so, we can do some batching in order to improve things.
> > I suspect your test is calling __clear_shadow_entry through the
> > truncate_exceptional_pvec_entries() path, which is already a batch.
> > Maybe something like patch [1] at the end of this mail.
> 
> So this apparently contributes as well but not too much. With your patch
> applied on top of 'notaginit' kernel above I've got to:
> 
> batched-xas 1473900.300000 950.439377

Fascinating that it reduces the stddev so much.  We can probably take this
further (getting into the realm of #3 below) -- the call to xas_set() will
restart the walk from the top of the tree each time.  Clearly this usecase
(many thousands of shadow entries) is going to construct a very deep tree,
and we're effectively doing a linear scan over the bottom of the tree, so
starting from the top each time is O(n.log n) instead of O(n).  I think
you said the file was 64GB, which is 16 million 4k entries, or 24 bits of
tree index.  That's 4 levels deep so it'll add up.

> > 3. Perhaps we can actually get rid of truncate_exceptional_pvec_entries().
> > It seems a little daft for page_cache_delete_batch() to skip value
> > entries, only for truncate_exceptional_pvec_entries() to erase them in
> > a second pass.  Truncation is truncation, and perhaps we can handle all
> > of it in one place?
> > 
> > 4. Now that calling through a function pointer is expensive, thanks to
> > Spectre/Meltdown/..., I've been considering removing the general-purpose
> > update function, which is only used by the page cache.  Instead move parts
> > of workingset.c into the XArray code and use a bit in the xa_flags to
> > indicate that the node should be tracked on an LRU if it contains only
> > value entries.
> 
> I agree these two are good ideas to improve the speed. But old radix tree
> code has these issues as well so they are not the reason of this
> regression. So I'd like to track down where Xarray code is slower first.
> 
> I'm going to dig more into annotated profiles...

Thanks!  I'll work on a patch to remove the xas_init_marks() from xas_store().