Re: [linux-next:master] [mm] 5df397dec7: will-it-scale.per_thread_ops -53.3% regression

Linus Torvalds <torvalds@xxxxxxxxxxxxxxxxxxxx> · Mon, 5 Dec 2022 12:43:37 -0800

On Mon, Dec 5, 2022 at 1:02 AM kernel test robot <yujie.liu@xxxxxxxxx> wrote:
>
> FYI, we noticed a -53.3% regression of will-it-scale.per_thread_ops due to commit:
> 5df397dec7c4 ("mm: delay page_remove_rmap() until after the TLB has been flushed")

Sadly, I think this may be at least partially expected.

The code fundamentally moves one "loop over pages" and splits it up
(with the TLB flush in between).

Which can't be great for locality, but it's kind of fundamental for
the fix - but some of it might be due to the batch limit logic.

I wouldn't have expected it to actually show up in any real loads, but:

> in testcase: will-it-scale
>         test: page_fault3

I assume that this test is doing a lot of mmap/munmap on dirty shared
memory regions (both because of the regression, and because of the
name of that test ;)

So this is hopefully an extreme case.

Now, it's likely that this particular case probably also triggers that

        /* No more batching if we have delayed rmaps pending */

which means that the loops in between the TLB flushes will be smaller,
since we don't batch up as many pages as we used to before we force a
TLB (and rmap) flush and free them.

If it's due to that batching issue it may be fixable - I'll think
about this some more, but

> Details are as below:

The bug it fixes ends up meaning that we run that rmap removal code
_after_ the TLB flush, and it looks like this (probably combined with
the batching limit) then causes some nasty iTLB load issues:

>    2291312 ą  2%   +1452.8%   35580378 ą  4%  perf-stat.i.iTLB-loads

although it also does look like it's at least partly due to some irq
timing issue (and/or bad NUMA/CPU migration luck):

>    388169          +267.4%    1426305 ą  6%  vmstat.system.in
>     161.37           +84.9%     298.43 ą  6%  perf-stat.ps.cpu-migrations
>    172442 ą  4%     +26.9%     218745 ą  8%  perf-stat.ps.node-load-misses

so it might be that some of the regression comes down to "bad luck" -
it happened to run really nicely on that particular machine, and then
the timing changes caused some random "phase change" to the load.

The profile doesn't actually seem to show all that much more IPI
overhead, so maybe these incidental issues are what then causes that
big regression.

It would be lovely to hear if you see this on other machines and/or loads.

Because if it's a one-off, it's probably best ignored. If it shows up
elsewhere, I think that batching logic might need looking at.

               Linus