Re: [linux-next:master] [memcg] 70a64b7919: will-it-scale.per_process_ops -11.9% regression

Shakeel Butt <shakeel.butt@xxxxxxxxx> · Mon, 27 May 2024 23:30:38 -0700

On Fri, May 24, 2024 at 11:06:54AM GMT, Shakeel Butt wrote:
> On Fri, May 24, 2024 at 03:45:54PM +0800, Oliver Sang wrote:
[...]
> I will re-run my experiments on linus tree and report back.

I am not able to reproduce the regression with the fix I have proposed,
at least on my 1 node 52 CPUs (Cooper Lake) and 2 node 80 CPUs (Skylake)
machines. Let me give more details below:

Setup instructions:
-------------------
mount -t tmpfs tmpfs /tmp
mkdir -p /sys/fs/cgroup/A
mkdir -p /sys/fs/cgroup/A/B
mkdir -p /sys/fs/cgroup/A/B/C
echo +memory > /sys/fs/cgroup/A/cgroup.subtree_control
echo +memory > /sys/fs/cgroup/A/B/cgroup.subtree_control
echo $$ > /sys/fs/cgroup/A/B/C/cgroup.procs

The base case (commit a4c43b8a0980):
------------------------------------
$ python3 ./runtest.py page_fault2 295 process 0 0 52
tasks,processes,processes_idle,threads,threads_idle,linear
0,0,100,0,100,0
52,2796769,0.03,0,0.00,0

$ python3 ./runtest.py page_fault2 295 process 0 0 80
tasks,processes,processes_idle,threads,threads_idle,linear
0,0,100,0,100,0
80,6755010,0.04,0,0.00,0

The regressing series (last commit a94032b35e5f)
------------------------------------------------
$ python3 ./runtest.py page_fault2 295 process 0 0 52
tasks,processes,processes_idle,threads,threads_idle,linear
0,0,100,0,100,0
52,2684859,0.03,0,0.00,0

$ python3 ./runtest.py page_fault2 295 process 0 0 80
tasks,processes,processes_idle,threads,threads_idle,linear
0,0,100,0,100,0
80,6010438,0.13,0,0.00,0

The fix on top of regressing series:
------------------------------------
$ python3 ./runtest.py page_fault2 295 process 0 0 52
tasks,processes,processes_idle,threads,threads_idle,linear
0,0,100,0,100,0
52,3812133,0.02,0,0.00,0

$ python3 ./runtest.py page_fault2 295 process 0 0 80
tasks,processes,processes_idle,threads,threads_idle,linear
0,0,100,0,100,0
80,7979893,0.15,0,0.00,0

As you can see, the fix is improving the performance over the base, at
least for me. I can only speculate that either the difference of
hardware is giving us different results (you have newer CPUs) or there
is still disparity of experiment setup/environment between us.

Are you disabling hyperthreading? Is the prefetching heuristics
different on your systems?

Regarding test environment, can you check my setup instructions above
and see if I am doing something wrong or different?

At the moment, I am inclined towards asking Andrew to include my fix in
following 6.10-rc* but keep this report open, so we continue to improve.
Let me know if you have concerns.

thanks,
Shakeel