> On Jan 31, 2019, at 4:56 AM, Peter Zijlstra <peterz@xxxxxxxxxxxxx> wrote: > > On Wed, Jan 30, 2019 at 10:01:32PM -0500, Alex Kogan wrote: >> Lock throughput can be increased by handing a lock to a waiter on the >> same NUMA socket as the lock holder, provided care is taken to avoid >> starvation of waiters on other NUMA sockets. This patch introduces CNA >> (compact NUMA-aware lock) as the slow path for qspinlock. > > Since you use NUMA, use the term node, not socket. The two are not > strictly related. Got it, thanks. > >> CNA is a NUMA-aware version of the MCS spin-lock. Spinning threads are >> organized in two queues, a main queue for threads running on the same >> socket as the current lock holder, and a secondary queue for threads >> running on other sockets. Threads record the ID of the socket on which >> they are running in their queue nodes. At the unlock time, the lock >> holder scans the main queue looking for a thread running on the same >> socket. If found (call it thread T), all threads in the main queue >> between the current lock holder and T are moved to the end of the >> secondary queue, and the lock is passed to T. If such T is not found, the >> lock is passed to the first node in the secondary queue. Finally, if the >> secondary queue is empty, the lock is passed to the next thread in the >> main queue. >> >> Full details are available at https://urldefense.proofpoint.com/v2/url?u=https-3A__arxiv.org_abs_1810.05600&d=DwIBAg&c=RoP1YumCXCgaWHvlZYR8PZh8Bv7qIrMUB65eapI_JnE&r=Hvhk3F4omdCk-GE1PTOm3Kn0A7ApWOZ2aZLTuVxFK4k&m=7sFZrsdpLJxLRHIFWN_sE6zgKy20Ti8lOoepiEyipAo&s=5VRAQVjw0B1SCjvBLzzwxkHQ6TZ3FIl_tGDfvn3FXvo&e=. > > Full details really should also be in the Changelog. You can skip much > of the academic bla-bla, but the Changelog should be self contained. > >> We have done some performance evaluation with the locktorture module >> as well as with several benchmarks from the will-it-scale repo. >> The following locktorture results are from an Oracle X5-4 server >> (four Intel Xeon E7-8895 v3 @ 2.60GHz sockets with 18 hyperthreaded >> cores each). Each number represents an average (over 5 runs) of the >> total number of ops (x10^7) reported at the end of each run. The stock >> kernel is v4.20.0-rc4+ compiled in the default configuration. >> >> #thr stock patched speedup (patched/stock) >> 1 2.710 2.715 1.002 >> 2 3.108 3.001 0.966 >> 4 4.194 3.919 0.934 > > So low contention is actually worse. Funnily low contention is the > majority of our locks and is _really_ important. This can be most certainly engineered out, e.g., by caching the node ID on which a task is running. We will look into that. > >> 8 5.309 6.894 1.299 >> 16 6.722 9.094 1.353 >> 32 7.314 9.885 1.352 >> 36 7.562 9.855 1.303 >> 72 6.696 10.358 1.547 >> 108 6.364 10.181 1.600 >> 142 6.179 10.178 1.647 >> >> When the kernel is compiled with lockstat enabled, CNA > > I'll ignore that, lockstat/lockdep enabled runs are not what one would > call performance relevant. Please, note that only one set of results has lockstat enabled. The rest of the results (will-it-scale included) do not have it. Regards, — Alex