The patch titled Subject: mm, vmstat: make quiet_vmstat lighter has been added to the -mm tree. Its filename is mm-vmstat-make-quiet_vmstat-lighter.patch This patch should soon appear at http://ozlabs.org/~akpm/mmots/broken-out/mm-vmstat-make-quiet_vmstat-lighter.patch and later at http://ozlabs.org/~akpm/mmotm/broken-out/mm-vmstat-make-quiet_vmstat-lighter.patch Before you just go and hit "reply", please: a) Consider who else should be cc'ed b) Prefer to cc a suitable mailing list as well c) Ideally: find the original patch on the mailing list and do a reply-to-all to that, adding suitable additional cc's *** Remember to use Documentation/SubmitChecklist when testing your code *** The -mm tree is included into linux-next and is updated there every 3-4 working days ------------------------------------------------------ From: Michal Hocko <mhocko@xxxxxxxx> Subject: mm, vmstat: make quiet_vmstat lighter Mike has reported a considerable overhead of refresh_cpu_vm_stats from the idle entry during pipe test: 12.89% [kernel] [k] refresh_cpu_vm_stats.isra.12 4.75% [kernel] [k] __schedule 4.70% [kernel] [k] mutex_unlock 3.14% [kernel] [k] __switch_to This is caused by 0eb77e988032 ("vmstat: make vmstat_updater deferrable again and shut down on idle") which has placed quiet_vmstat into cpu_idle_loop. The main reason here seems to be that the idle entry has to get over all zones and perform atomic operations for each vmstat entry even though there might be no per cpu diffs. This is a pointless overhead for _each_ idle entry. Make sure that quiet_vmstat is as light as possible. First of all it doesn't make any sense to do any local sync if the current cpu is already set in oncpu_stat_off because vmstat_update puts itself there only if there is nothing to do. Then we can check need_update which should be a cheap way to check for potential per-cpu diffs and only then do refresh_cpu_vm_stats. The original patch also did cancel_delayed_work which we are not doing here. There are two reasons for that. Firstly cancel_delayed_work from idle context will blow up on RT kernels (reported by Mike): [ 2.279582] CPU: 1 PID: 0 Comm: swapper/1 Not tainted 4.5.0-rt3 #7 [ 2.280444] Hardware name: MEDION MS-7848/MS-7848, BIOS M7848W08.20C 09/23/2013 [ 2.281316] ffff88040b00d640 ffff88040b01fe10 ffffffff812d20e2 0000000000000000 [ 2.282202] ffff88040b01fe30 ffffffff81081095 ffff88041ec4cee0 ffff88041ec501e0 [ 2.283073] ffff88040b01fe48 ffffffff815ff910 ffff88041ec4cee0 ffff88040b01fe88 [ 2.283941] Call Trace: [ 2.284797] [<ffffffff812d20e2>] dump_stack+0x49/0x67 [ 2.285658] [<ffffffff81081095>] ___might_sleep+0xf5/0x180 [ 2.286521] [<ffffffff815ff910>] rt_spin_lock+0x20/0x50 [ 2.287382] [<ffffffff81075919>] try_to_grab_pending+0x69/0x240 [ 2.288239] [<ffffffff81075b16>] cancel_delayed_work+0x26/0xe0 [ 2.289094] [<ffffffff8115ec05>] quiet_vmstat+0x75/0xa0 [ 2.289949] [<ffffffff8109ab38>] cpu_idle_loop+0x38/0x3e0 [ 2.290800] [<ffffffff8109aef3>] cpu_startup_entry+0x13/0x20 [ 2.291647] [<ffffffff81036164>] start_secondary+0x114/0x140 And secondly, even on !RT kernels it might add some non trivial overhead which is not necessary. Even if the vmstat worker wakes up and preempts idle then it will be most likely a single shot noop because the stats were already synced and so it would end up on the oncpu_stat_off anyway. We just need to teach both vmstat_shepherd and vmstat_update to stop scheduling the worker if there is nothing to do. Signed-off-by: Michal Hocko <mhocko@xxxxxxxx> Reported-by: Mike Galbraith <umgwanakikbuti@xxxxxxxxx> Acked-by: Christoph Lameter <cl@xxxxxxxxx> Signed-off-by: Andrew Morton <akpm@xxxxxxxxxxxxxxxxxxxx> --- mm/vmstat.c | 64 ++++++++++++++++++++++++++++++++++---------------- 1 file changed, 44 insertions(+), 20 deletions(-) diff -puN mm/vmstat.c~mm-vmstat-make-quiet_vmstat-lighter mm/vmstat.c --- a/mm/vmstat.c~mm-vmstat-make-quiet_vmstat-lighter +++ a/mm/vmstat.c @@ -1396,10 +1396,15 @@ static void vmstat_update(struct work_st * Counters were updated so we expect more updates * to occur in the future. Keep on running the * update worker thread. + * If we were marked on cpu_stat_off clear the flag + * so that vmstat_shepherd doesn't schedule us again. */ - queue_delayed_work_on(smp_processor_id(), vmstat_wq, - this_cpu_ptr(&vmstat_work), - round_jiffies_relative(sysctl_stat_interval)); + if (!cpumask_test_and_clear_cpu(smp_processor_id(), + cpu_stat_off)) { + queue_delayed_work_on(smp_processor_id(), vmstat_wq, + this_cpu_ptr(&vmstat_work), + round_jiffies_relative(sysctl_stat_interval)); + } } else { /* * We did not update any counters so the app may be in @@ -1417,18 +1422,6 @@ static void vmstat_update(struct work_st * until the diffs stay at zero. The function is used by NOHZ and can only be * invoked when tick processing is not active. */ -void quiet_vmstat(void) -{ - if (system_state != SYSTEM_RUNNING) - return; - - do { - if (!cpumask_test_and_set_cpu(smp_processor_id(), cpu_stat_off)) - cancel_delayed_work(this_cpu_ptr(&vmstat_work)); - - } while (refresh_cpu_vm_stats(false)); -} - /* * Check if the diffs for a certain cpu indicate that * an update is needed. @@ -1452,6 +1445,30 @@ static bool need_update(int cpu) return false; } +void quiet_vmstat(void) +{ + if (system_state != SYSTEM_RUNNING) + return; + + /* + * If we are already in hands of the shepherd then there + * is nothing for us to do here. + */ + if (cpumask_test_and_set_cpu(smp_processor_id(), cpu_stat_off)) + return; + + if (!need_update(smp_processor_id())) + return; + + /* + * Just refresh counters and do not care about the pending delayed + * vmstat_update. It doesn't fire that often to matter and canceling + * it would be too expensive from this path. + * vmstat_shepherd will take care about that for us. + */ + refresh_cpu_vm_stats(false); +} + /* * Shepherd worker thread that checks the @@ -1470,11 +1487,18 @@ static void vmstat_shepherd(struct work_ get_online_cpus(); /* Check processors whose vmstat worker threads have been disabled */ for_each_cpu(cpu, cpu_stat_off) - if (need_update(cpu) && - cpumask_test_and_clear_cpu(cpu, cpu_stat_off)) - - queue_delayed_work_on(cpu, vmstat_wq, - &per_cpu(vmstat_work, cpu), 0); + if (need_update(cpu)) { + if (cpumask_test_and_clear_cpu(cpu, cpu_stat_off)) + queue_delayed_work_on(cpu, vmstat_wq, + &per_cpu(vmstat_work, cpu), 0); + } else { + /* + * Cancel the work if quiet_vmstat has put this + * cpu on cpu_stat_off because the work item might + * be still scheduled + */ + cancel_delayed_work(this_cpu_ptr(&vmstat_work)); + } put_online_cpus(); _ Patches currently in -mm which might be from mhocko@xxxxxxxx are mm-vmstat-make-quiet_vmstat-lighter.patch vmstat-make-vmstat_update-deferrable.patch mm-oom-rework-oom-detection.patch mm-throttle-on-io-only-when-there-are-too-many-dirty-and-writeback-pages.patch mm-use-watermak-checks-for-__gfp_repeat-high-order-allocations.patch mm-oom-introduce-oom-reaper.patch mm-oom-introduce-oom-reaper-v4.patch oom-reaper-handle-anonymous-mlocked-pages.patch oom-clear-tif_memdie-after-oom_reaper-managed-to-unmap-the-address-space.patch -- To unsubscribe from this list: send the line "unsubscribe mm-commits" in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at http://vger.kernel.org/majordomo-info.html