On Thu, Apr 01, 2021 at 11:31:56AM -0400, Alex Kogan wrote: > This performance optimization chooses probabilistically to avoid moving > threads from the main queue into the secondary one when the secondary queue > is empty. > > It is helpful when the lock is only lightly contended. In particular, it > makes CNA less eager to create a secondary queue, but does not introduce > any extra delays for threads waiting in that queue once it is created. > > Signed-off-by: Alex Kogan <alex.kogan@xxxxxxxxxx> > Reviewed-by: Steve Sistare <steven.sistare@xxxxxxxxxx> > Reviewed-by: Waiman Long <longman@xxxxxxxxxx> > --- > kernel/locking/qspinlock_cna.h | 39 ++++++++++++++++++++++++++++++++++ > 1 file changed, 39 insertions(+) > > diff --git a/kernel/locking/qspinlock_cna.h b/kernel/locking/qspinlock_cna.h > index 29c3abbd3d94..983c6a47a221 100644 > --- a/kernel/locking/qspinlock_cna.h > +++ b/kernel/locking/qspinlock_cna.h > @@ -5,6 +5,7 @@ > > #include <linux/topology.h> > #include <linux/sched/rt.h> > +#include <linux/random.h> > > /* > * Implement a NUMA-aware version of MCS (aka CNA, or compact NUMA-aware lock). > @@ -86,6 +87,34 @@ static inline bool intra_node_threshold_reached(struct cna_node *cn) > return current_time - threshold > 0; > } > > +/* > + * Controls the probability for enabling the ordering of the main queue > + * when the secondary queue is empty. The chosen value reduces the amount > + * of unnecessary shuffling of threads between the two waiting queues > + * when the contention is low, while responding fast enough and enabling > + * the shuffling when the contention is high. > + */ > +#define SHUFFLE_REDUCTION_PROB_ARG (7) Out of curiosity: Have you used other values and done measurements what's an efficient value for SHUFFLE_REDUCTION_PROB_ARG? Maybe I miscalculated it, but if I understand it correctly this value implies that the propability is 0.9921875 that below function returns true. My question is probably answered by following statement from referenced paper: "In our experiments with the shuffle reduction optimization enabled, we set THRESHOLD2 to 0xff." (page with figure 5) > + > +/* Per-CPU pseudo-random number seed */ > +static DEFINE_PER_CPU(u32, seed); > + > +/* > + * Return false with probability 1 / 2^@num_bits. > + * Intuitively, the larger @num_bits the less likely false is to be returned. > + * @num_bits must be a number between 0 and 31. > + */ > +static bool probably(unsigned int num_bits) > +{ > + u32 s; > + > + s = this_cpu_read(seed); > + s = next_pseudo_random32(s); > + this_cpu_write(seed, s); > + > + return s & ((1 << num_bits) - 1); > +} > + > static void __init cna_init_nodes_per_cpu(unsigned int cpu) > { > struct mcs_spinlock *base = per_cpu_ptr(&qnodes[0].mcs, cpu); > @@ -293,6 +322,16 @@ static __always_inline u32 cna_wait_head_or_lock(struct qspinlock *lock, > { > struct cna_node *cn = (struct cna_node *)node; > > + if (node->locked <= 1 && probably(SHUFFLE_REDUCTION_PROB_ARG)) { Again if I understand it correctly with SHUFFLE_REDUCTION_PROB_ARG==7 it's roughly 1 out of 100 cases where probably() returns false. Why/when is this beneficial? I assume it has to do with following statement in the referenced paper: "The superior performance over MCS at 4 threads is the result of the shuffling that does take place once in a while, organizing threads’ arrivals to the lock in a way that reduces the inter-socket lock migration without the need to continuously modify the main queue. ..." (page with figure 9; the paper has no page numbers) But OTHO why this pseudo randomness? How about deterministically treating every 100th execution differently (it also matches "once in a while") and thus entirely removing the pseudo randomness? Have you tried this? If so why was it worse than pseudo randomness? (Or maybe I missed something and pseudo randomness is required for other reasons there.) > + /* > + * When the secondary queue is empty, skip the call to > + * cna_order_queue() below with high probability. This optimization > + * reduces the overhead of unnecessary shuffling of threads > + * between waiting queues when the lock is only lightly contended. > + */ > + return 0; > + } > + > if (!cn->start_time || !intra_node_threshold_reached(cn)) { > /* > * We are at the head of the wait queue, no need to use > -- > 2.24.3 (Apple Git-128) > -- Regards, Andreas