Zeng Heng noted that heavy use of the osq (optimistic spin queue) code used rather more cpu than might be expected. See: https://lore.kernel.org/lkml/202312210155.Wc2HUK8C-lkp@xxxxxxxxx/T/#mcc46eedd1ef22a0d668828b1d088508c9b1875b8 Part of the problem is there is a pretty much guaranteed cache line reload reading node->prev->cpu for the vcpu_is_preempted() check (even on bare metal) in the wakeup path which slows it down. (On bare metal the hypervisor call is patched out, but the argument is still read.) Careful analysis shows that it isn't necessary to dirty the per-cpu data on the fast-path osq_lock() path. This may be slightly beneficial. The code also uses this_cpu_ptr() to get the address of the per-cpu data. On x86-64 (at least) this is implemented as: &per_cpu_data[smp_processor_id()]->member ie there is a real function call, an array index and an add. However if raw_cpu_read() can used then (which is typically just an offset from register - %gs for x86-64) the code will be faster. Putting the address of the per-cpu node into itself means that only one cache line need be loaded. I can't see a list of per-cpu data initialisation functions, so the fields are initialised on the first osq_lock() call. The last patch avoids the cache line reload calling vcpu_is_preempted() by simply saving node->prev->cpu as node->prev_cpu and updating it when node->prev changes. This is simpler than the patch proposed by Waimon. David Laight (5): Move the definition of optimistic_spin_node into osf_lock.c Avoid dirtying the local cpu's 'node' in the osq_lock() fast path. Clarify osq_wait_next() Optimise per-cpu data accesses. Optimise vcpu_is_preempted() check. include/linux/osq_lock.h | 5 ---- kernel/locking/osq_lock.c | 61 +++++++++++++++++++++------------------ 2 files changed, 33 insertions(+), 33 deletions(-) -- 2.17.1 - Registered Address Lakeside, Bramley Road, Mount Farm, Milton Keynes, MK1 1PT, UK Registration No: 1397386 (Wales)