On Fri, 29 Dec 2023 at 12:57, David Laight <David.Laight@xxxxxxxxxx> wrote: > > this_cpu_ptr() is rather more expensive than raw_cpu_read() since > the latter can use an 'offset from register' (%gs for x86-84). > > Add a 'self' field to 'struct optimistic_spin_node' that can be > read with raw_cpu_read(), initialise on first call. No, this is horrible. The problem isn't the "this_cpu_ptr()", it's the rest of the code. > bool osq_lock(struct optimistic_spin_queue *lock) > { > - struct optimistic_spin_node *node = this_cpu_ptr(&osq_node); > + struct optimistic_spin_node *node = raw_cpu_read(osq_node.self); No. Both of these are crap. > struct optimistic_spin_node *prev, *next; > int old; > > - if (unlikely(node->cpu == OSQ_UNLOCKED_VAL)) > - node->cpu = encode_cpu(smp_processor_id()); > + if (unlikely(!node)) { > + int cpu = encode_cpu(smp_processor_id()); > + node = decode_cpu(cpu); > + node->self = node; > + node->cpu = cpu; > + } The proper fix here is to not do that silly node = this_cpu_ptr(&osq_node); .. node->next = NULL; dance at all, but to simply do this_cpu_write(osq_node.next, NULL); in the first place. That makes the whole thing just a single store off the segment descriptor. Yes, you'll eventually end up doing that node = this_cpu_ptr(&osq_node); thing because it then wants to use that raw pointer to do WRITE_ONCE(prev->next, node); but that's a separate issue and still does not make it worth it to create a pointless self-pointer. Btw, if you *really* want to solve that separate issue, then make the optimistic_spin_node struct not contain the pointers at all, but the CPU numbers, and then turn those numbers into the pointers the exact same way it does for the "lock->tail" thing, ie doing that whole prev = decode_cpu(old); dance. That *may* then result in avoiding turning them into pointers at all in some cases. Also, I think that you might want to look into making OSQ_UNLOCKED_VAL be -1 instead, and add something like #define IS_OSQ_UNLOCKED(x) ((int)(x)<0) and that would then avoid the +1 / -1 games in encoding/decoding the CPU numbers. It causes silly code generated like this: subl $1, %eax #, cpu_nr ... cltq addq __per_cpu_offset(,%rax,8), %rcx which seems honestly stupid. The cltq is there for sign-extension, which is because all these things are "int", and the "subl" will zero-extend to 64-bit, not sign-extend. At that point, I think gcc might be able to just generate addq __per_cpu_offset-8(,%rax,8), %rcx but honestly, I think it would be nicer to just have decode_cpu() do unsigned int cpu_nr = encoded_cpu_val; return per_cpu_ptr(&osq_node, cpu_nr); and not have the -1/+1 at all. Hmm? UNTESTED patch to just do the "this_cpu_write()" parts attached. Again, note how we do end up doing that this_cpu_ptr conversion later anyway, but at least it's off the critical path. Linus
kernel/locking/osq_lock.c | 12 +++++++----- 1 file changed, 7 insertions(+), 5 deletions(-) diff --git a/kernel/locking/osq_lock.c b/kernel/locking/osq_lock.c index 75a6f6133866..c3a166b7900c 100644 --- a/kernel/locking/osq_lock.c +++ b/kernel/locking/osq_lock.c @@ -92,14 +92,14 @@ osq_wait_next(struct optimistic_spin_queue *lock, bool osq_lock(struct optimistic_spin_queue *lock) { - struct optimistic_spin_node *node = this_cpu_ptr(&osq_node); + struct optimistic_spin_node *node; struct optimistic_spin_node *prev, *next; int curr = encode_cpu(smp_processor_id()); int old; - node->locked = 0; - node->next = NULL; - node->cpu = curr; + this_cpu_write(osq_node.next, NULL); + this_cpu_write(osq_node.locked, 0); + this_cpu_write(osq_node.cpu, curr); /* * We need both ACQUIRE (pairs with corresponding RELEASE in @@ -112,7 +112,9 @@ bool osq_lock(struct optimistic_spin_queue *lock) return true; prev = decode_cpu(old); - node->prev = prev; + this_cpu_write(osq_node.prev, prev); + + node = this_cpu_ptr(&osq_node); /* * osq_lock() unqueue