Re: Truncate regression due to commit 69b6c1319b6

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



On Wed 27-02-19 04:24:51, Matthew Wilcox wrote:
> On Wed, Feb 27, 2019 at 12:27:21PM +0100, Jan Kara wrote:
> > On Tue 26-02-19 09:27:44, Matthew Wilcox wrote:
> > > On Tue, Feb 26, 2019 at 05:56:28PM +0100, Jan Kara wrote:
> > > > after some peripeties, I was able to bisect down to a regression in
> > > > truncate performance caused by commit 69b6c1319b6 "mm: Convert truncate to
> > > > XArray".
> > > 
> > > [...]
> > > 
> > > > I've gathered also perf profiles but from the first look they don't show
> > > > anything surprising besides xas_load() and xas_store() taking up more time
> > > > than original counterparts did. I'll try to dig more into this but any idea
> > > > is appreciated.
> > > 
> > > Well, that's a short and sweet little commit.  Stripped of comment
> > > changes, it's just:
> > > 
> > > -       struct radix_tree_node *node;
> > > -       void **slot;
> > > +       XA_STATE(xas, &mapping->i_pages, index);
> > >  
> > > -       if (!__radix_tree_lookup(&mapping->i_pages, index, &node, &slot))
> > > +       xas_set_update(&xas, workingset_update_node);
> > > +       if (xas_load(&xas) != entry)
> > >                 return;
> > > -       if (*slot != entry)
> > > -               return;
> > > -       __radix_tree_replace(&mapping->i_pages, node, slot, NULL,
> > > -                            workingset_update_node);
> > > +       xas_store(&xas, NULL);
> > 
> > Yes, the code change is small. Thanks to you splitting the changes, the
> > regression is easier to analyze so thanks for that :)
> 
> That was why I split the patches so small.  I'm kind of glad it paid off ...
> not glad to have caused a performance regression!
> 
> > > I have a few reactions to this:
> > > 
> > > 1. I'm concerned that the XArray may generally be slower than the radix
> > > tree was.  I didn't notice that in my testing, but maybe I didn't do
> > > the right tests.
> > 
> > So one difference I've noticed when staring into the code and annotated
> > perf traces is that xas_store() will call xas_init_marks() when stored
> > entry is 0. __radix_tree_replace() didn't do this. And the cache misses we
> > take from checking tags do add up. After hacking the code in xas_store() so
> > that __clear_shadow_entry() does not touch tags, I get around half of the
> > regression back. For now I didn't think how to do this so that the API
> > remains reasonably clean. So now we are at:
> > 
> > COMMIT      AVG            STDDEV
> > a97e7904c0  1431256.500000 1489.361759
> > 69b6c1319b  1566944.000000 2252.692877
> > notaginit   1483740.700000 7680.583455
> 
> Well, that seems worth doing.  For the page cache case, we know that
> shadow entries have no tags set (at least right now), so it seems
> reasonable to move the xas_init_marks() from xas_store() to its various
> callers.
> 
> > > 2. The setup overhead of the XA_STATE might be a problem.
> > > If so, we can do some batching in order to improve things.
> > > I suspect your test is calling __clear_shadow_entry through the
> > > truncate_exceptional_pvec_entries() path, which is already a batch.
> > > Maybe something like patch [1] at the end of this mail.
> > 
> > So this apparently contributes as well but not too much. With your patch
> > applied on top of 'notaginit' kernel above I've got to:
> > 
> > batched-xas 1473900.300000 950.439377
> 
> Fascinating that it reduces the stddev so much.  We can probably take this

I would not concentrate on the reduction of stddev too much. The high
stddev for 'notaginit' is caused by one relatively big outlier (otherwise
we are at ~2200). But still the patch reduced stddev to about a half so
there is some improvement.

> further (getting into the realm of #3 below) -- the call to xas_set() will
> restart the walk from the top of the tree each time.  Clearly this usecase
> (many thousands of shadow entries) is going to construct a very deep tree,
> and we're effectively doing a linear scan over the bottom of the tree, so
> starting from the top each time is O(n.log n) instead of O(n).  I think
> you said the file was 64GB, which is 16 million 4k entries, or 24 bits of
> tree index.  That's 4 levels deep so it'll add up.

Actually the benchmark creates 64 files, 1GB each, so the depth of the tree
will be just 3. But yes, traversing from top of the tree each time only to
zero out one long there looks just wasteful. It seems we should be able to
have much more efficient truncate implementation which would just trim the
whole node worth of xarray - working set entries would be directly
destroyed, pages returned locked (provided they can be locked with
trylock).

Looking more into perf traces, I didn't notice any other obvious low
hanging fruit. There is one suboptimality in the fact that
__clear_shadow_entry() does xas_load() and the first thing xas_store() does
is xas_load() again. Removing this double tree traversal does bring
something back but since the traversals are so close together, everything
is cache hot and so the overall gain is small (but still):

COMMIT     AVG            STDDEV
singleiter 1467763.900000 1078.261049

So still 34 ms to go to the original time.

What profiles do show is there's slightly more time spent here and there
adding to overall larger xas_store() time (compared to
__radix_tree_replace()) mostly due to what I'd blame to cache misses
(xas_store() is responsible for ~3.4% of cache misses after the patch while
xas_store() + __radix_tree_replace() caused only 1.5% together before).

Some of the expensive loads seem to be from 'xas' structure (kind
of matches with what Nick said), other expensive loads seem to be loads from
xa_node. And I don't have a great explanation why e.g. a load of
xa_node->count is expensive when we looked at xa_node->shift before -
either the cache line fell out of cache or the byte accesses somehow
confuse the CPU. Also xas_store() has some new accesses compared to
__radix_tree_replace() - e.g. it did not previously touch node->shift.

So overall I don't see easy way how to speed up xarray code further so
maybe just batching truncate to make up for some of the losses and live
with them where we cannot batch will be as good as it gets...

								Honza

> > > 3. Perhaps we can actually get rid of truncate_exceptional_pvec_entries().
> > > It seems a little daft for page_cache_delete_batch() to skip value
> > > entries, only for truncate_exceptional_pvec_entries() to erase them in
> > > a second pass.  Truncation is truncation, and perhaps we can handle all
> > > of it in one place?
> > > 
> > > 4. Now that calling through a function pointer is expensive, thanks to
> > > Spectre/Meltdown/..., I've been considering removing the general-purpose
> > > update function, which is only used by the page cache.  Instead move parts
> > > of workingset.c into the XArray code and use a bit in the xa_flags to
> > > indicate that the node should be tracked on an LRU if it contains only
> > > value entries.
> > 
> > I agree these two are good ideas to improve the speed. But old radix tree
> > code has these issues as well so they are not the reason of this
> > regression. So I'd like to track down where Xarray code is slower first.
> > 
> > I'm going to dig more into annotated profiles...
> 
> Thanks!  I'll work on a patch to remove the xas_init_marks() from xas_store().
-- 
Jan Kara <jack@xxxxxxxx>
SUSE Labs, CR




[Index of Archives]     [Linux ARM Kernel]     [Linux ARM]     [Linux Omap]     [Fedora ARM]     [IETF Annouce]     [Bugtraq]     [Linux OMAP]     [Linux MIPS]     [eCos]     [Asterisk Internet PBX]     [Linux API]

  Powered by Linux