Hi folks, (I think Lihao is travelling at the moment, so he may be delayed in his replies) On Wed, Jan 22, 2020 at 12:24:58PM -0500, Waiman Long wrote: > On 1/22/20 6:45 AM, Lihao Liang wrote: > > On Wed, Jan 22, 2020 at 10:28 AM Alex Kogan <alex.kogan@xxxxxxxxxx> wrote: > >> Summary > >> ------- > >> > >> Lock throughput can be increased by handing a lock to a waiter on the > >> same NUMA node as the lock holder, provided care is taken to avoid > >> starvation of waiters on other NUMA nodes. This patch introduces CNA > >> (compact NUMA-aware lock) as the slow path for qspinlock. It is > >> enabled through a configuration option (NUMA_AWARE_SPINLOCKS). > >> > > Thanks for your patches. The experimental results look promising! > > > > I understand that the new CNA qspinlock uses randomization to achieve > > long-term fairness, and provides the numa_spinlock_threshold parameter > > for users to tune. As Linux runs extremely diverse workloads, it is not > > clear how randomization affects its fairness, and how users with > > different requirements are supposed to tune this parameter. > > > > To this end, Will and I consider it beneficial to be able to answer the > > following question: > > > > With different values of numa_spinlock_threshold and > > SHUFFLE_REDUCTION_PROB_ARG, how long do threads running on different > > sockets have to wait to acquire the lock? This is particularly relevant > > in high contention situations when new threads keep arriving on the same > > socket as the lock holder. > > > > In this email, I try to provide some formal analysis to address this > > question. Let's assume the probability for the lock to stay on the > > same socket is *at least* p, which corresponds to the probability for > > the function probably(unsigned int num_bits) in the patch to return *false*, > > where SHUFFLE_REDUCTION_PROB_ARG is passed as the value of num_bits to the > > function. > > That is not strictly true from my understanding of the code. The > probably() function does not come into play if a secondary queue is > present. Also calling cna_scan_main_queue() doesn't guarantee that a > waiter in the same node can be found. So the simple mathematical > analysis isn't that applicable in this case. One will have to do an > actual simulation to find out what the actual behavior will be. It's certainly true that the analysis is based on the worst-case scenario, but I think it's still worth considering. For example, the secondary queue does not exist initially so it seems a bit odd that we only instantiate it with < 1% probability. That said, my real concern with any of this is that it makes formal modelling and analysis of the qspinlock considerably more challenging. I would /really/ like to see an update to the TLA+ model we have of the current implementation [1] and preferably also the userspace version I hacked together [2] so that we can continue to test and validate changes to the code outside of the usual kernel stress-testing. Will [1] https://git.kernel.org/pub/scm/linux/kernel/git/cmarinas/kernel-tla.git/ [2] https://mirrors.edge.kernel.org/pub/linux/kernel/people/will/spinbench/