Hi Alex and Waiman, Thanks a lot for your swift response and clarification. On Wed, Jan 22, 2020 at 7:30 PM Alex Kogan <alex.kogan@xxxxxxxxxx> wrote: > > Hi, Lihao. > > > On Jan 22, 2020, at 6:45 AM, Lihao Liang <lihaoliang@xxxxxxxxxx> wrote: > > > > Hi Alex, > > > > On Wed, Jan 22, 2020 at 10:28 AM Alex Kogan <alex.kogan@xxxxxxxxxx> wrote: > >> > >> Summary > >> ------- > >> > >> Lock throughput can be increased by handing a lock to a waiter on the > >> same NUMA node as the lock holder, provided care is taken to avoid > >> starvation of waiters on other NUMA nodes. This patch introduces CNA > >> (compact NUMA-aware lock) as the slow path for qspinlock. It is > >> enabled through a configuration option (NUMA_AWARE_SPINLOCKS). > >> > > > > Thanks for your patches. The experimental results look promising! > > > > I understand that the new CNA qspinlock uses randomization to achieve > > long-term fairness, and provides the numa_spinlock_threshold parameter > > for users to tune. > This has been the case in the first versions of the series, but is not true anymore. > That is, the long-term fairness is achieved deterministically (and you are correct > that it is done through the numa_spinlock_threshold parameter). > > > As Linux runs extremely diverse workloads, it is not > > clear how randomization affects its fairness, and how users with > > different requirements are supposed to tune this parameter. > > > > To this end, Will and I consider it beneficial to be able to answer the > > following question: > > > > With different values of numa_spinlock_threshold and > > SHUFFLE_REDUCTION_PROB_ARG, how long do threads running on different > > sockets have to wait to acquire the lock? > The SHUFFLE_REDUCTION_PROB_ARG parameter is intended for performance > optimization only, and *does not* affect the long-term fairness (or, at the > very least, does not make it any worse). As Longman correctly pointed out in > his response to this email, the shuffle reduction optimization is relevant only > when the secondary queue is empty. In that case, CNA hands-off the lock > exactly as MCS does, i.e., in the FIFO order. Note that when the secondary > queue is not empty, we do not call probably(). > > > This is particularly relevant > > in high contention situations when new threads keep arriving on the same > > socket as the lock holder. > In this case, the lock will stay on the same NUMA node/socket for > 2^numa_spinlock_threshold times, which is the worst case scenario if we > consider the long-term fairness. And if we have multiple nodes, it will take > up to 2^numa_spinlock_threshold X (nr_nodes - 1) + nr_cpus_per_node > lock transitions until any given thread will acquire the lock > (assuming 2^numa_spinlock_threshold > nr_cpus_per_node). > You're right that the latest version of the patch handles long-term fairness deterministically. As I understand it, the n-th thread in the main queue is guaranteed to acquire the lock after N lock handovers, where N is bounded by n - 1 + 2^numa_spinlock_threshold * (nr_nodes - 1) I'm not sure what role the variable nr_cpus_per_node plays in your analysis. Do I miss anything? Many thanks, Lihao. > Hopefully, it addresses your concern. Let me know if you have any further > questions. > > Best regards, > — Alex >