From: Linus Torvalds > Sent: 30 December 2023 20:41 > > On Fri, 29 Dec 2023 at 12:57, David Laight <David.Laight@xxxxxxxxxx> wrote: > > > > this_cpu_ptr() is rather more expensive than raw_cpu_read() since > > the latter can use an 'offset from register' (%gs for x86-84). > > > > Add a 'self' field to 'struct optimistic_spin_node' that can be > > read with raw_cpu_read(), initialise on first call. > > No, this is horrible. > > The problem isn't the "this_cpu_ptr()", it's the rest of the code. > > > bool osq_lock(struct optimistic_spin_queue *lock) > > { > > - struct optimistic_spin_node *node = this_cpu_ptr(&osq_node); > > + struct optimistic_spin_node *node = raw_cpu_read(osq_node.self); > > No. Both of these are crap. > > > struct optimistic_spin_node *prev, *next; > > int old; > > > > - if (unlikely(node->cpu == OSQ_UNLOCKED_VAL)) > > - node->cpu = encode_cpu(smp_processor_id()); > > + if (unlikely(!node)) { > > + int cpu = encode_cpu(smp_processor_id()); > > + node = decode_cpu(cpu); > > + node->self = node; > > + node->cpu = cpu; > > + } > > The proper fix here is to not do that silly > > node = this_cpu_ptr(&osq_node); > .. > node->next = NULL; > > dance at all, but to simply do > > this_cpu_write(osq_node.next, NULL); > > in the first place. That makes the whole thing just a single store off > the segment descriptor. Is the equivalent true (ie offset from fixed register) for all SMP archs? Or do some have to do something rather more complicated? > Yes, you'll eventually end up doing that > > node = this_cpu_ptr(&osq_node); For some reason I've had CONFIG_DEBUG_PREEMPT set. I don't remember setting it, and can't imagine why I might have. Best guess is it was inherited from the ubuntu .config I started with. In any case it changes smp_processor_id() into a real function in order to check that preemption is disabled. I'd guess something like BUG_ON(!raw_cpu_read(preempt_disable_count)) would be faster and more obvious! > thing because it then wants to use that raw pointer to do > > WRITE_ONCE(prev->next, node); > > but that's a separate issue and still does not make it worth it to > create a pointless self-pointer. I could claim that loading it is one instruction shorter and that if the cache line containing 'node' is needed and 'pcpu_hot' is (unexpectedly) not cached it saves a cache line load. But I probably won't! > > Btw, if you *really* want to solve that separate issue, then make the > optimistic_spin_node struct not contain the pointers at all, but the > CPU numbers, and then turn those numbers into the pointers the exact > same way it does for the "lock->tail" thing, ie doing that whole > > prev = decode_cpu(old); > > dance. That *may* then result in avoiding turning them into pointers > at all in some cases. Don't think so. Pretty much all the uses need to dereference the next/prev pointers. > Also, I think that you might want to look into making OSQ_UNLOCKED_VAL > be -1 instead, and add something like > > #define IS_OSQ_UNLOCKED(x) ((int)(x)<0) I did think about that (but not for these patches). But it is a lot more dangerous because it absolutely requires the structure be correctly initialised, not just be all zero. That might show up some very strange bugs. > and that would then avoid the +1 / -1 games in encoding/decoding the > CPU numbers. It causes silly code generated like this: > > subl $1, %eax #, cpu_nr > ... > cltq > addq __per_cpu_offset(,%rax,8), %rcx > > which seems honestly stupid. The cltq is there for sign-extension, > which is because all these things are "int", and the "subl" will > zero-extend to 64-bit, not sign-extend. Changing all the variables to 'unsigned int' will remove the sign extension and, after the decrement, gcc will know the high bits are zero so shouldn't need to zero extend. > At that point, I think gcc might be able to just generate > > addq __per_cpu_offset-8(,%rax,8), %rcx That would need the decrement to be 64bit. A quick test failed to make that work. Probably (as you mentioned in the next email) because gcc doesn't know that the high bits from atomic_xchg() are zero. > but honestly, I think it would be nicer to just have decode_cpu() do > > unsigned int cpu_nr = encoded_cpu_val; > > return per_cpu_ptr(&osq_node, cpu_nr); > > and not have the -1/+1 at all. Unless you can persuade gcc that the high bits from atomic_xchg() are zero that will require a zero extend. David - Registered Address Lakeside, Bramley Road, Mount Farm, Milton Keynes, MK1 1PT, UK Registration No: 1397386 (Wales)