On Fri, May 24, 2024 at 11:06:54AM GMT, Shakeel Butt wrote: > On Fri, May 24, 2024 at 03:45:54PM +0800, Oliver Sang wrote: [...] > I will re-run my experiments on linus tree and report back. I am not able to reproduce the regression with the fix I have proposed, at least on my 1 node 52 CPUs (Cooper Lake) and 2 node 80 CPUs (Skylake) machines. Let me give more details below: Setup instructions: ------------------- mount -t tmpfs tmpfs /tmp mkdir -p /sys/fs/cgroup/A mkdir -p /sys/fs/cgroup/A/B mkdir -p /sys/fs/cgroup/A/B/C echo +memory > /sys/fs/cgroup/A/cgroup.subtree_control echo +memory > /sys/fs/cgroup/A/B/cgroup.subtree_control echo $$ > /sys/fs/cgroup/A/B/C/cgroup.procs The base case (commit a4c43b8a0980): ------------------------------------ $ python3 ./runtest.py page_fault2 295 process 0 0 52 tasks,processes,processes_idle,threads,threads_idle,linear 0,0,100,0,100,0 52,2796769,0.03,0,0.00,0 $ python3 ./runtest.py page_fault2 295 process 0 0 80 tasks,processes,processes_idle,threads,threads_idle,linear 0,0,100,0,100,0 80,6755010,0.04,0,0.00,0 The regressing series (last commit a94032b35e5f) ------------------------------------------------ $ python3 ./runtest.py page_fault2 295 process 0 0 52 tasks,processes,processes_idle,threads,threads_idle,linear 0,0,100,0,100,0 52,2684859,0.03,0,0.00,0 $ python3 ./runtest.py page_fault2 295 process 0 0 80 tasks,processes,processes_idle,threads,threads_idle,linear 0,0,100,0,100,0 80,6010438,0.13,0,0.00,0 The fix on top of regressing series: ------------------------------------ $ python3 ./runtest.py page_fault2 295 process 0 0 52 tasks,processes,processes_idle,threads,threads_idle,linear 0,0,100,0,100,0 52,3812133,0.02,0,0.00,0 $ python3 ./runtest.py page_fault2 295 process 0 0 80 tasks,processes,processes_idle,threads,threads_idle,linear 0,0,100,0,100,0 80,7979893,0.15,0,0.00,0 As you can see, the fix is improving the performance over the base, at least for me. I can only speculate that either the difference of hardware is giving us different results (you have newer CPUs) or there is still disparity of experiment setup/environment between us. Are you disabling hyperthreading? Is the prefetching heuristics different on your systems? Regarding test environment, can you check my setup instructions above and see if I am doing something wrong or different? At the moment, I am inclined towards asking Andrew to include my fix in following 6.10-rc* but keep this report open, so we continue to improve. Let me know if you have concerns. thanks, Shakeel