Re: [PATCH] mm: convert mm's rss stats into percpu_counter

Shakeel Butt <shakeelb@xxxxxxxxxx> · Thu, 8 Jun 2023 17:37:00 +0000

On Thu, Jun 08, 2023 at 01:14:08PM +0200, Jan Kara wrote:
[...]
> 
> Somewhat late to the game but our performance testing grid has noticed this
> commit causes a performance regression on shell-heavy workloads. For
> example running 'make test' in git sources on our test machine with 192
> CPUs takes about 4% longer, system time is increased by about 9%:
> 
>                        before (9cd6ffa6025)  after (f1a7941243c1)
> Amean     User         471.12 *   0.30%*     481.77 *  -1.96%*
> Amean     System       244.47 *   0.90%*     269.13 *  -9.09%*
> Amean     Elapsed      709.22 *   0.45%*     742.27 *  -4.19%*
> Amean     CPU          100.00 (   0.20%)     101.00 *  -0.80%*
> 
> Essentially this workload spawns in sequence a lot of short-lived tasks and
> the task startup + teardown cost is what this patch increases. To
> demonstrate this more clearly, I've written trivial (and somewhat stupid)
> benchmark shell_bench.sh:
> 
> for (( i = 0; i < 20000; i++ )); do
> 	/bin/true
> done
> 
> And when run like:
> 
> numactl -C 1 ./shell_bench.sh
> 
> (I've forced physical CPU binding to avoid task migrating over the machine
> and cpu frequency scaling interfering which makes the numbers much more
> noisy) I get the following elapsed times:
> 
>          9cd6ffa6025    f1a7941243c1
> Avg      6.807429       7.631571
> Stddev   0.021797       0.016483
> 
> So some 12% regression in elapsed time. Just to be sure I've verified that
> per-cpu allocator patch [1] does not improve these numbers in any
> significant way.
> 
> Where do we go from here? I think in principle the problem could be fixed
> by being clever and when the task has only a single thread, we don't bother
> with allocating pcpu counter (and summing it at the end) and just account
> directly in mm_struct. When the second thread is spawned, we bite the
> bullet, allocate pcpu counter and start with more scalable accounting.
> These shortlived tasks in shell workloads or similar don't spawn any
> threads so this should fix the regression. But this is obviously easier
> said than done...
> 

Thanks Jan for the report. I wanted to improve the percpu allocation to
eliminate this regression as it was reported by intel test bot as well.
However your suggestion seems seems targetted and reasonable as well. At
the moment I am travelling, so not sure when I will get to this. Do you
want to take a stab at it or you want me to do it? Also how urgent and
sensitive this regression is for you?

thanks,
Shakeel