Re: [linux-next:master] [mm] 5df397dec7: will-it-scale.per_thread_ops -53.3% regression

"Huang, Ying" <ying.huang@xxxxxxxxx> · Tue, 06 Dec 2022 10:02:27 +0800

Linus Torvalds <torvalds@xxxxxxxxxxxxxxxxxxxx> writes:

> On Mon, Dec 5, 2022 at 1:02 AM kernel test robot <yujie.liu@xxxxxxxxx> wrote:
>>
>> FYI, we noticed a -53.3% regression of will-it-scale.per_thread_ops due to commit:
>> 5df397dec7c4 ("mm: delay page_remove_rmap() until after the TLB has been flushed")
>
> Sadly, I think this may be at least partially expected.
>
> The code fundamentally moves one "loop over pages" and splits it up
> (with the TLB flush in between).
>
> Which can't be great for locality, but it's kind of fundamental for
> the fix - but some of it might be due to the batch limit logic.
>
> I wouldn't have expected it to actually show up in any real loads, but:
>
>> in testcase: will-it-scale
>>         test: page_fault3
>
> I assume that this test is doing a lot of mmap/munmap on dirty shared
> memory regions (both because of the regression, and because of the
> name of that test ;)

I have checked the source code of will-it-scale/page_fault3.  Yes, it
exactly does that.

> So this is hopefully an extreme case.
>
> Now, it's likely that this particular case probably also triggers that
>
>         /* No more batching if we have delayed rmaps pending */
>
> which means that the loops in between the TLB flushes will be smaller,
> since we don't batch up as many pages as we used to before we force a
> TLB (and rmap) flush and free them.
>
> If it's due to that batching issue it may be fixable - I'll think
> about this some more, but
>
>> Details are as below:
>
> The bug it fixes ends up meaning that we run that rmap removal code
> _after_ the TLB flush, and it looks like this (probably combined with
> the batching limit) then causes some nasty iTLB load issues:
>
>>    2291312 ą  2%   +1452.8%   35580378 ą  4%  perf-stat.i.iTLB-loads
>
> although it also does look like it's at least partly due to some irq
> timing issue (and/or bad NUMA/CPU migration luck):
>
>>    388169          +267.4%    1426305 ą  6%  vmstat.system.in
>>     161.37           +84.9%     298.43 ą  6%  perf-stat.ps.cpu-migrations
>>    172442 ą  4%     +26.9%     218745 ą  8%  perf-stat.ps.node-load-misses
>
> so it might be that some of the regression comes down to "bad luck" -
> it happened to run really nicely on that particular machine, and then
> the timing changes caused some random "phase change" to the load.
>
> The profile doesn't actually seem to show all that much more IPI
> overhead, so maybe these incidental issues are what then causes that
> big regression.

      0.00            +8.5        8.49   5%  perf-profile.calltrace.cycles-pp.flush_tlb_func.__flush_smp_call_function_queue.__sysvec_call_function.sysvec_call_function.asm_sysvec_call_function

>From perf profiling, the cycles for TLB flushing increases much.  So I
guess it may be related?

> It would be lovely to hear if you see this on other machines and/or loads.

Will ask 0-Day guys to check this.

Best Regards,
Huang, Ying

> Because if it's a one-off, it's probably best ignored. If it shows up
> elsewhere, I think that batching logic might need looking at.
>
>                Linus