Matthew Wilcox's on February 27, 2019 3:27 am: > On Tue, Feb 26, 2019 at 05:56:28PM +0100, Jan Kara wrote: >> after some peripeties, I was able to bisect down to a regression in >> truncate performance caused by commit 69b6c1319b6 "mm: Convert truncate to >> XArray". > > [...] > >> I've gathered also perf profiles but from the first look they don't show >> anything surprising besides xas_load() and xas_store() taking up more time >> than original counterparts did. I'll try to dig more into this but any idea >> is appreciated. > > Well, that's a short and sweet little commit. Stripped of comment > changes, it's just: > > - struct radix_tree_node *node; > - void **slot; > + XA_STATE(xas, &mapping->i_pages, index); > > - if (!__radix_tree_lookup(&mapping->i_pages, index, &node, &slot)) > + xas_set_update(&xas, workingset_update_node); > + if (xas_load(&xas) != entry) > return; > - if (*slot != entry) > - return; > - __radix_tree_replace(&mapping->i_pages, node, slot, NULL, > - workingset_update_node); > + xas_store(&xas, NULL); > > I have a few reactions to this: > > 1. I'm concerned that the XArray may generally be slower than the radix > tree was. I didn't notice that in my testing, but maybe I didn't do > the right tests. > > 2. The setup overhead of the XA_STATE might be a problem. > If so, we can do some batching in order to improve things. > I suspect your test is calling __clear_shadow_entry through the > truncate_exceptional_pvec_entries() path, which is already a batch. > Maybe something like patch [1] at the end of this mail. One nasty thing about the XA_STATE stack object as opposed to just passing the parameters (in the same order) down to children is that you get the same memory accessed nearby, but in different ways (different base register, offset, addressing mode etc). Which can reduce effectiveness of memory disambiguation prediction, at least in cold predictor case. I've seen (on some POWER CPUs at least) flushes due to aliasing access in some of these xarray call chains, although no idea if that actually makes a noticable difference in microbenchmark like this. But it's not the greatest pattern to use for passing to low level performance critical functions :( Ideally the compiler could just do a big LTO pass right at the end and unwind it all back into registers and fix everything, but that will never happen. Thanks, Nick