Re: [PATCH 0/3] Add NUMA-awareness to qspinlock

Alex Kogan <alex.kogan@xxxxxxxxxx> · Fri, 1 Feb 2019 16:20:53 -0500



> On Jan 31, 2019, at 4:56 AM, Peter Zijlstra <peterz@xxxxxxxxxxxxx> wrote:
> 
> On Wed, Jan 30, 2019 at 10:01:32PM -0500, Alex Kogan wrote:
>> Lock throughput can be increased by handing a lock to a waiter on the
>> same NUMA socket as the lock holder, provided care is taken to avoid
>> starvation of waiters on other NUMA sockets. This patch introduces CNA
>> (compact NUMA-aware lock) as the slow path for qspinlock.
> 
> Since you use NUMA, use the term node, not socket. The two are not
> strictly related.
Got it, thanks.

> 
>> CNA is a NUMA-aware version of the MCS spin-lock. Spinning threads are
>> organized in two queues, a main queue for threads running on the same
>> socket as the current lock holder, and a secondary queue for threads
>> running on other sockets. Threads record the ID of the socket on which
>> they are running in their queue nodes. At the unlock time, the lock
>> holder scans the main queue looking for a thread running on the same
>> socket. If found (call it thread T), all threads in the main queue
>> between the current lock holder and T are moved to the end of the
>> secondary queue, and the lock is passed to T. If such T is not found, the
>> lock is passed to the first node in the secondary queue. Finally, if the
>> secondary queue is empty, the lock is passed to the next thread in the
>> main queue.
>> 
>> Full details are available at https://urldefense.proofpoint.com/v2/url?u=https-3A__arxiv.org_abs_1810.05600&d=DwIBAg&c=RoP1YumCXCgaWHvlZYR8PZh8Bv7qIrMUB65eapI_JnE&r=Hvhk3F4omdCk-GE1PTOm3Kn0A7ApWOZ2aZLTuVxFK4k&m=7sFZrsdpLJxLRHIFWN_sE6zgKy20Ti8lOoepiEyipAo&s=5VRAQVjw0B1SCjvBLzzwxkHQ6TZ3FIl_tGDfvn3FXvo&e=.
> 
> Full details really should also be in the Changelog. You can skip much
> of the academic bla-bla, but the Changelog should be self contained.
> 
>> We have done some performance evaluation with the locktorture module
>> as well as with several benchmarks from the will-it-scale repo.
>> The following locktorture results are from an Oracle X5-4 server
>> (four Intel Xeon E7-8895 v3 @ 2.60GHz sockets with 18 hyperthreaded
>> cores each). Each number represents an average (over 5 runs) of the
>> total number of ops (x10^7) reported at the end of each run. The stock
>> kernel is v4.20.0-rc4+ compiled in the default configuration.
>> 
>> #thr  stock  patched speedup (patched/stock)
>>  1   2.710   2.715  1.002
>>  2   3.108   3.001  0.966
>>  4   4.194   3.919  0.934
> 
> So low contention is actually worse. Funnily low contention is the
> majority of our locks and is _really_ important.
This can be most certainly engineered out, e.g., by caching the node ID on which a task is running.
We will look into that.

> 
>>  8   5.309   6.894  1.299
>> 16   6.722   9.094  1.353
>> 32   7.314   9.885  1.352
>> 36   7.562   9.855  1.303
>> 72   6.696  10.358  1.547
>> 108   6.364  10.181  1.600
>> 142   6.179  10.178  1.647
>> 
>> When the kernel is compiled with lockstat enabled, CNA 
> 
> I'll ignore that, lockstat/lockdep enabled runs are not what one would
> call performance relevant.
Please, note that only one set of results has lockstat enabled.
The rest of the results (will-it-scale included) do not have it.

Regards,
— Alex