Re: [RFC PATCH v2 2/2] mm: convert mm's rss stats to use atomic mode

Kairui Song <ryncsn@xxxxxxxxx> · Thu, 16 May 2024 19:50:52 +0800

On Fri, Apr 19, 2024 at 11:32 AM zhangpeng (AS) <zhangpeng362@xxxxxxxxxx> wrote:
> On 2024/4/19 10:30, Rongwei Wang wrote:
> > On 2024/4/18 22:20, Peng Zhang wrote:
> >> From: ZhangPeng <zhangpeng362@xxxxxxxxxx>
> >>
> >> Since commit f1a7941243c1 ("mm: convert mm's rss stats into
> >> percpu_counter"), the rss_stats have converted into percpu_counter,
> >> which convert the error margin from (nr_threads * 64) to approximately
> >> (nr_cpus ^ 2). However, the new percpu allocation in mm_init() causes a
> >> performance regression on fork/exec/shell. Even after commit
> >> 14ef95be6f55
> >> ("kernel/fork: group allocation/free of per-cpu counters for mm
> >> struct"),
> >> the performance of fork/exec/shell is still poor compared to previous
> >> kernel versions.
> >>
> >> To mitigate performance regression, we delay the allocation of percpu
> >> memory for rss_stats. Therefore, we convert mm's rss stats to use
> >> percpu_counter atomic mode. For single-thread processes, rss_stat is in
> >> atomic mode, which reduces the memory consumption and performance
> >> regression caused by using percpu. For multiple-thread processes,
> >> rss_stat is switched to the percpu mode to reduce the error margin.
> >> We convert rss_stats from atomic mode to percpu mode only when the
> >> second thread is created.
> > Hi, Zhang Peng
> >
> > This regression we also found it in lmbench these days. I have not
> > test your patch, but it seems will solve a lot for it.
> > And I see this patch not fix the regression in multi-threads, that's
> > because of the rss_stat switched to percpu mode?
> > (If I'm wrong, please correct me.) And It seems percpu_counter also
> > has a bad effect in exit_mmap().
> >
> > If so, I'm wondering if we can further improving it on the exit_mmap()
> > path in multi-threads scenario, e.g. to determine which CPUs the
> > process has run on (mm_cpumask()? I'm not sure).
> >
> Hi, Rongwei,
>
> Yes, this patch only fixes the regression in single-thread processes. How
> much bad effect does percpu_counter have in exit_mmap()? IMHO, the addition
> of mm counter is already in batch mode, maybe I miss something?
>

Hi, Peng Zhang, Rongwei, and all:

I've a patch series that is earlier than commit f1a7941243c1 ("mm:
convert mm's rss stats into
percpu_counter"):

https://lwn.net/ml/linux-kernel/20220728204511.56348-1-ryncsn@xxxxxxxxx/

Instead of a per-mm-per-cpu cache, it used only one global per-cpu
cache, and flush it on schedule. Or, if the arch supports, flush and
fetch it use mm bitmap as an optimization (like tlb shootdown).

Unfortunately it didn't get much attention and I moved to work on other things.
I also noticed the fork regression issue, so I did a local rebase of
my previous patch, and revert f1a7941243c1.

The result is looking good, on my 32 core VM machine, I see similar
improvement as the one you posted (alloc/free on fork/exit is gone), I
also see minor improvement with database tests, memory usage is lower
by a little bit too (no more per-mm cache), and I think the error
margin in my patch should be close to zero.

I hope I can get some attention here for my idea...