Re: [PATCH] mm/memcontrol: improve memory.stat reporting

Jiang Biao <jiangbiao@xxxxxxxxxxxxxxxxx> · Mon, 19 Nov 2018 14:33:39 +0800

Hi,

On 11/19/18 1:19 PM, Shakeel Butt wrote:
On Sun, Nov 18, 2018 at 3:27 PM Jiang Biao <jiangbiao@xxxxxxxxxxxxxxxxx> wrote:
commit a983b5ebee57 ("mm:memcontrol: fix excessive complexity in
memory.stat reporting") introduce 8%+ performance regression for
page_fault3 of will-it-scale benchmark:

Before commit a983b5ebee57,
#./runtest.py page_fault3
tasks,processes,processes_idle,threads,threads_idle,linear
0,0,100,0,100,0
1,729990,95.68,725437,95.66,725437 (Single process)
...
24,11476599,0.18,2185947,32.67,17410488 (24 processes for 24 cores)

After commit,
#./runtest.py page_fault3
tasks,processes,processes_idle,threads,threads_idle,linear
0,0,100,0,100,0
1,697310,95.61,703615,95.66,703615 (-4.48%)
...
24,10485783,0.20,2047735,35.99,16886760 (-8.63%)

Get will-it-scale benchmark and test page_fault3,
  # git clone https://github.com/antonblanchard/will-it-scale.git
  # cd will-it-scale/
  # ./runtest.py page_fault3

There are to factors that affect the proformance,
1, CHARGE_BATCH is too small that causes bad contention when charge
global stats/events.
2, Disabling interrupt in count_memcg_events/mod_memcg_stat.

This patch increase the CHARGE_BATCH to 256 to ease the contention,
And narrow the scope of disabling interrupt(only if x > CHARGE_BATCH)
when charging global stats/events, taking percpu counter's
implementation as reference.

This patch could fix the performance regression,
#./runtest.py page_fault3
tasks,processes,processes_idle,threads,threads_idle,linear
0,0,100,0,100,0
1,729975,95.64,730302,95.68,730302
24,11441125,0.07,2100586,31.08,17527248

Signed-off-by: Jiang Biao <jiangbiao@xxxxxxxxxxxxxxxxx>
---
  include/linux/memcontrol.h | 28 +++++++++++++---------------
  1 file changed, 13 insertions(+), 15 deletions(-)

diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h
index db3e6bb..7546774 100644
--- a/include/linux/memcontrol.h
+++ b/include/linux/memcontrol.h
@@ -595,7 +595,7 @@ static inline void set_task_memcg_oom_skip(struct task_struct *p)
   * size of first charge trial. "32" comes from vmscan.c's magic value.
   * TODO: maybe necessary to use big numbers in big irons.
   */
-#define CHARGE_BATCH   32U
+#define CHARGE_BATCH   256U

I am already not really happy with '32U' as it has introduced
potentially 100s of MiB discrepancy in the stats on large machines. On
a 100 cpu and 64 KiB page machine, the error in the stats can be up to
(32 * 65565 * 100) bytes.
Indeed, it worse the case.
I was thinking of piggybacking stat_refresh on syncing the per-cpu
memcg stats but still couldn't figure out the right way to do that
(i.e. not traversing memcg tree on each cpu in parallel).

Also can you please rebase your patch over the latest mm or linus tree?

Sorry for that, I'll rebase after the discussion in this thread and 
maybe find better way to fix

the regression.


  static inline void __count_memcg_events(struct mem_cgroup *memcg,
                                 enum mem_cgroup_events_index idx,
@@ -608,22 +608,21 @@ static inline void __count_memcg_events(struct mem_cgroup *memcg,

         x = count + __this_cpu_read(memcg->stat->events[idx]);
         if (unlikely(x > CHARGE_BATCH)) {
+               unsigned long flags;
+               local_irq_save(flags);
Does this fine grained interrupt disable really help?

Yes, especially for the single thread scenario(could have 4%+ 
performance promotion for page_fault3).


Also don't you
need to reevaluate x after disabling interrupt?

That's why we use 'x - count', it could help to maintain consistency 
with the interruption context, taking __percpu_counter_add as reference,

void __percpu_counter_add(struct percpu_counter *fbc, s64 amount, s32 batch)
{
    s64 count;

    preempt_disable();
    count = __this_cpu_read(*fbc->counters) + amount;
    if (count >= batch || count <= -batch) {
        unsigned long flags;
        raw_spin_lock_irqsave(&fbc->lock, flags);
        fbc->count += count;
        __this_cpu_sub(*fbc->counters, count - amount); //Here to sync 
with concurrent modification in interruption
        raw_spin_unlock_irqrestore(&fbc->lock, flags);
    } else {
        this_cpu_add(*fbc->counters, amount);
    }
    preempt_enable();
}

Also, in this case, we should change memcg->stat->events from 'unsigned 
long' to 'long', which is missed in this patch.


                 atomic_long_add(x, &memcg->events[idx]);
-               x = 0;
+               __this_cpu_sub(memcg->stat->events[idx], x - count);
Why 'x - count'?

+               local_irq_restore(flags);
+       } else {
+               this_cpu_add(memcg->stat->events[idx], count);
         }
-
-       __this_cpu_write(memcg->stat->events[idx], x);
  }

  static inline void count_memcg_events(struct mem_cgroup *memcg,
                                 enum mem_cgroup_events_index idx,
                                 unsigned long count)
  {
-       unsigned long flags;
-
-       local_irq_save(flags);
         __count_memcg_events(memcg, idx, count);
-       local_irq_restore(flags);
  }

  static inline void
@@ -698,25 +697,24 @@ static inline void __mod_memcg_stat(struct mem_cgroup *memcg,

         if (mem_cgroup_disabled())
                 return;
-
         if (memcg) {
                 x = val + __this_cpu_read(memcg->stat->count[idx]);
                 if (unlikely(abs(x) > CHARGE_BATCH)) {
+                       unsigned long flags;
+                       local_irq_save(flags);
                         atomic_long_add(x, &memcg->stats[idx]);
-                       x = 0;
+                       __this_cpu_sub(memcg->stat->count[idx], x - val);
+                       local_irq_restore(flags);
+               } else {
+                       this_cpu_add(memcg->stat->count[idx], val);
                 }
-               __this_cpu_write(memcg->stat->count[idx], x);
         }
  }

  static inline void mod_memcg_stat(struct mem_cgroup *memcg,
                         enum mem_cgroup_stat_index idx, int val)
  {
-       unsigned long flags;
-
-       local_irq_save(flags);
         __mod_memcg_stat(memcg, idx, val);
-       local_irq_restore(flags);
  }

thanks,
Shakeel