Hi, Lihao. > On Jan 22, 2020, at 6:45 AM, Lihao Liang <lihaoliang@xxxxxxxxxx> wrote: > > Hi Alex, > > On Wed, Jan 22, 2020 at 10:28 AM Alex Kogan <alex.kogan@xxxxxxxxxx> wrote: >> >> Summary >> ------- >> >> Lock throughput can be increased by handing a lock to a waiter on the >> same NUMA node as the lock holder, provided care is taken to avoid >> starvation of waiters on other NUMA nodes. This patch introduces CNA >> (compact NUMA-aware lock) as the slow path for qspinlock. It is >> enabled through a configuration option (NUMA_AWARE_SPINLOCKS). >> > > Thanks for your patches. The experimental results look promising! > > I understand that the new CNA qspinlock uses randomization to achieve > long-term fairness, and provides the numa_spinlock_threshold parameter > for users to tune. This has been the case in the first versions of the series, but is not true anymore. That is, the long-term fairness is achieved deterministically (and you are correct that it is done through the numa_spinlock_threshold parameter). > As Linux runs extremely diverse workloads, it is not > clear how randomization affects its fairness, and how users with > different requirements are supposed to tune this parameter. > > To this end, Will and I consider it beneficial to be able to answer the > following question: > > With different values of numa_spinlock_threshold and > SHUFFLE_REDUCTION_PROB_ARG, how long do threads running on different > sockets have to wait to acquire the lock? The SHUFFLE_REDUCTION_PROB_ARG parameter is intended for performance optimization only, and *does not* affect the long-term fairness (or, at the very least, does not make it any worse). As Longman correctly pointed out in his response to this email, the shuffle reduction optimization is relevant only when the secondary queue is empty. In that case, CNA hands-off the lock exactly as MCS does, i.e., in the FIFO order. Note that when the secondary queue is not empty, we do not call probably(). > This is particularly relevant > in high contention situations when new threads keep arriving on the same > socket as the lock holder. In this case, the lock will stay on the same NUMA node/socket for 2^numa_spinlock_threshold times, which is the worst case scenario if we consider the long-term fairness. And if we have multiple nodes, it will take up to 2^numa_spinlock_threshold X (nr_nodes - 1) + nr_cpus_per_node lock transitions until any given thread will acquire the lock (assuming 2^numa_spinlock_threshold > nr_cpus_per_node). Hopefully, it addresses your concern. Let me know if you have any further questions. Best regards, — Alex