Hi Alex, On Wed, Jan 22, 2020 at 10:28 AM Alex Kogan <alex.kogan@xxxxxxxxxx> wrote: > > Summary > ------- > > Lock throughput can be increased by handing a lock to a waiter on the > same NUMA node as the lock holder, provided care is taken to avoid > starvation of waiters on other NUMA nodes. This patch introduces CNA > (compact NUMA-aware lock) as the slow path for qspinlock. It is > enabled through a configuration option (NUMA_AWARE_SPINLOCKS). > Thanks for your patches. The experimental results look promising! I understand that the new CNA qspinlock uses randomization to achieve long-term fairness, and provides the numa_spinlock_threshold parameter for users to tune. As Linux runs extremely diverse workloads, it is not clear how randomization affects its fairness, and how users with different requirements are supposed to tune this parameter. To this end, Will and I consider it beneficial to be able to answer the following question: With different values of numa_spinlock_threshold and SHUFFLE_REDUCTION_PROB_ARG, how long do threads running on different sockets have to wait to acquire the lock? This is particularly relevant in high contention situations when new threads keep arriving on the same socket as the lock holder. In this email, I try to provide some formal analysis to address this question. Let's assume the probability for the lock to stay on the same socket is *at least* p, which corresponds to the probability for the function probably(unsigned int num_bits) in the patch to return *false*, where SHUFFLE_REDUCTION_PROB_ARG is passed as the value of num_bits to the function. I noticed that the default value of p in the patch is 1/2^7 = 0.01, which is somewhat counter-intuitive to me. If we switch sockets 99 times out of 100, then fairness should be obvious. What I expected is a (much) higher value of p, which would likely result in better performance, while having some degree of fairness guarantee. Have you run some experiments by setting a lower SHUFFLE_REDUCTION_PROB_ARG instead of the default value 7? It would be very helpful to know the performance numbers. Now let's do some analysis: 1. What is the probability P for the first thread on a different socket to acquire the lock after *at most* N consecutive local lock handovers? Note: N corresponds to the variable intra_node_handoff_threshold in the patch, which is set to value 1 << numa_spinlock_threshold. Default value is 1 << 16 = 64K. Assuming mutual independence [1], we have P is equal to 1 - p^N, where p^N is the probability of N consecutive threads running on the socket where the lock was most recently acquired. If p is 0.99, the probabilities of switching to a different socket after N local lock handovers are as follows: 63.4% (N = 100) 86.6% (N = 200) 99.3% (N = 500) 99.996% (N = 1000) 99.99999999999933% (N = 64K) 2. We can ask the same question as above for the k-th thread on a different socket from the lock holder. That is, what is the probability P for the k-th thread on a different socket to acquire the lock after *at most* N consecutive local lock handovers, assuming all these k threads in the queue are running on different sockets (the worst case scenario). The analysis is as follows (the case when k = 1 reduces to Question 1 above): The total probability P is the sum of Pi for i = 0, 1, ..., N, where Pi is the probability of having i *total* local lock handovers before the k-th thread on a different socket can acquire the lock. Pi can be calculated using formula Pi = B_i_k * (p^i) * (1 - p)^k, where -- B_i_k is the number of ways to put i balls into k buckets, representing all possible ways the i local handovers occurred in k different sockets. B_i_k is a multiset number and equal to (i + k - 1)! / (i! * (k-1)!) [2] -- p^i is the probability of i local lock handovers -- (1 - p)^k is the probability of k socket switchings I've written a simple Python script to calculate the value of P. Let's look at some concrete examples and numbers. When p = 0.99, k = 3 (e.g. a 4-socket system), P is equal to: 8.5% (N = 100) 33.2% (N = 200) 87.9% (N = 500) 99.7% (N = 1000) 99.99999999999937% (N = 64K) When p = 0.99, k = 7 (e.g. an 8-socket system), the values of P are: 0.01% (N = 100) 0.52% (N = 200) 24.7% (N = 500) 87.5% (N = 1000) 99.3% (N = 1500) 99.99999999999871% (N = 64K) I think this mathematical analysis would help users better understand the fairness property of the CNA qspinlock. One can use it to plot a graph with different values of p and N to tune the qspinlock for different platforms and workloads. Based on the analysis above, it may be useful to have SHUFFLE_REDUCTION_PROB_ARG as a tunable parameter as well. Setting SHUFFLE_REDUCTION_PROB_ARG to a lower value results in a higher value of p, which would likely increase the performance. Then we can set intra_node_handoff_threshold to have a bounded degree of fairness. For instance, a user may want P to be around 90% for N = 100 on a 8-core system. So they can set p = 0.9 and intra_node_handoff_threshold = ~150, based on our analysis that P = 91.9% for N = 100, and 99.99% for N = 200, when k = 7. I hope this helps and please let me know if you have any comments or if you spot any mistakes in our analysis. Best, Lihao. References: [1] https://en.wikipedia.org/wiki/Independence_(probability_theory)#More_than_two_events [2] Theorem 2, https://en.wikipedia.org/wiki/Stars_and_bars_(combinatorics)