Re: [RFC PATCH 0/4] sched+mm: Track lazy active mm existence with hazard pointers

Mathieu Desnoyers <mathieu.desnoyers@xxxxxxxxxxxx> · Wed, 2 Oct 2024 12:02:09 -0400

On 2024-10-02 17:58, Jens Axboe wrote:
On 10/2/24 9:53 AM, Mathieu Desnoyers wrote:
On 2024-10-02 17:36, Mathieu Desnoyers wrote:
On 2024-10-02 17:33, Matthew Wilcox wrote:
On Wed, Oct 02, 2024 at 11:26:27AM -0400, Mathieu Desnoyers wrote:
On 2024-10-02 16:09, Paul E. McKenney wrote:
On Tue, Oct 01, 2024 at 09:02:01PM -0400, Mathieu Desnoyers wrote:
Hazard pointers appear to be a good fit for replacing refcount based lazy
active mm tracking.

Highlight:

will-it-scale context_switch1_threads

nr threads (-t)     speedup
       24                +3%
       48               +12%
       96               +21%
      192               +28%
Impressive!!!

I have to ask...  Any data for smaller numbers of CPUs?
Sure, but they are far less exciting ;-)
How many CPUs in the system under test?
2 sockets, 96-core per socket:

CPU(s):                   384
    On-line CPU(s) list:    0-383
Vendor ID:                AuthenticAMD
    Model name:             AMD EPYC 9654 96-Core Processor
      CPU family:           25
      Model:                17
      Thread(s) per core:   2
      Core(s) per socket:   96
      Socket(s):            2
      Stepping:             1
      Frequency boost:      enabled
      CPU(s) scaling MHz:   68%
      CPU max MHz:          3709.0000
      CPU min MHz:          400.0000
      BogoMIPS:             4800.00

Note that Jens Axboe got even more impressive speedups testing this
on his 512-hw-thread EPYC [1] (390% speedup for 192 threads). I've
noticed I had schedstats and sched debug enabled in my config, so I'll have to re-run my tests.
A quick re-run of the 128-thread case with schedstats and sched debug
disabled still show around 26% speedup, similar to my prior numbers.

I'm not sure why Jens has much better speedups on a similar system.

I'm attaching my config in case someone spots anything obvious. Note
that my BIOS is configured to show 24 NUMA nodes to the kernel (one
NUMA node per core complex).
Here's my .config - note it's from the stock kernel run, which is why it
still has:

CONFIG_MMU_LAZY_TLB_REFCOUNT=y

set. Have the same numa configuration as you, just end up with 32 nodes
on this box.
Just to make sure: did you use other command line options when starting
the test program (other than -t N ?).

Thanks,

Mathieu

--
Mathieu Desnoyers
EfficiOS Inc.
https://www.efficios.com