On Sat, 30 Dec 2023 at 12:41, Linus Torvalds <torvalds@xxxxxxxxxxxxxxxxxxxx> wrote: > > UNTESTED patch to just do the "this_cpu_write()" parts attached. > Again, note how we do end up doing that this_cpu_ptr conversion later > anyway, but at least it's off the critical path. Also note that while 'this_cpu_ptr()' doesn't exactly generate lovely code, it really is still better than caching a value in memory. At least the memory location that 'this_cpu_ptr()' accesses is slightly more likely to be hot (and is right next to the cpu number, iirc). That said, I think we should fix this_cpu_ptr() to not ever generate that disgusting cltq just because the cpu pointer has the wrong signedness. I don't quite know how to do it, but this: -#define per_cpu_offset(x) (__per_cpu_offset[x]) +#define per_cpu_offset(x) (__per_cpu_offset[(unsigned)(x)]) at least helps a *bit*. It gets rid of the cltq, at least, but if somebody actually passes in an 'unsigned long' cpuid, it would cause an unnecessary truncation. And gcc still generates subl $1, %eax #, cpu_nr addq __per_cpu_offset(,%rax,8), %rcx instead of just doing addq __per_cpu_offset-8(,%rax,8), %rcx because it still needs to clear the upper 32 bits and doesn't know that the 'xchg()' already did that. Oh well. I guess even without the -1/+1 games by the OSQ code, we would still end up with a "movl" just to do that upper bits clearing that the compiler doesn't know is unnecessary. I don't think we have any reasonable way to tell the compiler that the register output of our xchg() inline asm has the upper 32 bits clear. Linus