Re: [lkp-robot] [mm] 7674270022: will-it-scale.per_process_ops -19.3% regression

Nadav Amit <nadav.amit@xxxxxxxxx> · Mon, 7 Aug 2017 22:51:00 -0700

Nadav Amit <nadav.amit@xxxxxxxxx> wrote:

> Minchan Kim <minchan@xxxxxxxxxx> wrote:
> 
>> Hi,
>> 
>> On Tue, Aug 08, 2017 at 09:19:23AM +0800, kernel test robot wrote:
>>> Greeting,
>>> 
>>> FYI, we noticed a -19.3% regression of will-it-scale.per_process_ops due to commit:
>>> 
>>> 
>>> commit: 76742700225cad9df49f05399381ac3f1ec3dc60 ("mm: fix MADV_[FREE|DONTNEED] TLB flush miss problem")
>>> url: https://github.com/0day-ci/linux/commits/Nadav-Amit/mm-migrate-prevent-racy-access-to-tlb_flush_pending/20170802-205715
>>> 
>>> 
>>> in testcase: will-it-scale
>>> on test machine: 88 threads Intel(R) Xeon(R) CPU E5-2699 v4 @ 2.20GHz with 64G memory
>>> with following parameters:
>>> 
>>> 	nr_task: 16
>>> 	mode: process
>>> 	test: brk1
>>> 	cpufreq_governor: performance
>>> 
>>> test-description: Will It Scale takes a testcase and runs it from 1 through to n parallel copies to see if the testcase will scale. It builds both a process and threads based test in order to see any differences between the two.
>>> test-url: https://github.com/antonblanchard/will-it-scale
>> 
>> Thanks for the report.
>> Could you explain what kinds of workload you are testing?
>> 
>> Does it calls frequently madvise(MADV_DONTNEED) in parallel on multiple
>> threads?
> 
> According to the description it is "testcase:brk increase/decrease of one
> page”. According to the mode it spawns multiple processes, not threads.
> 
> Since a single page is unmapped each time, and the iTLB-loads increase
> dramatically, I would suspect that for some reason a full TLB flush is
> caused during do_munmap().
> 
> If I find some free time, I’ll try to profile the workload - but feel free
> to beat me to it.

The root-cause appears to be that tlb_finish_mmu() does not call
dec_tlb_flush_pending() - as it should. Any chance you can take care of it?

Having said that it appears that cpumask_any_but() is really inefficient
since it does not have an optimization for the case in which
small_const_nbits(nbits)==true. When I find some free time, I’ll try to deal
with it.

Thanks,
Nadav