RE: [PATCH next 4/5] locking/osq_lock: Optimise per-cpu data accesses.

David Laight <David.Laight@xxxxxxxxxx> · Sun, 31 Dec 2023 11:56:05 +0000

From: Linus Torvalds
> Sent: 30 December 2023 20:59
> 
> On Sat, 30 Dec 2023 at 12:41, Linus Torvalds
> <torvalds@xxxxxxxxxxxxxxxxxxxx> wrote:
> >
> > UNTESTED patch to just do the "this_cpu_write()" parts attached.
> > Again, note how we do end up doing that this_cpu_ptr conversion later
> > anyway, but at least it's off the critical path.
> 
> Also note that while 'this_cpu_ptr()' doesn't exactly generate lovely
> code, it really is still better than caching a value in memory.
> 
> At least the memory location that 'this_cpu_ptr()' accesses is
> slightly more likely to be hot (and is right next to the cpu number,
> iirc).

I was only going to access the 'self' field in code that required
the 'node' cache line be present.

> 
> That said, I think we should fix this_cpu_ptr() to not ever generate
> that disgusting cltq just because the cpu pointer has the wrong
> signedness. I don't quite know how to do it, but this:
> 
>   -#define per_cpu_offset(x) (__per_cpu_offset[x])
>   +#define per_cpu_offset(x) (__per_cpu_offset[(unsigned)(x)])
> 
> at least helps a *bit*. It gets rid of the cltq, at least, but if
> somebody actually passes in an 'unsigned long' cpuid, it would cause
> an unnecessary truncation.

Doing the conversion using arithmetic might help, so:
		__per_cpu_offset[(x) + 0u]

> And gcc still generates
> 
>         subl    $1, %eax        #, cpu_nr
>         addq    __per_cpu_offset(,%rax,8), %rcx
> 
> instead of just doing
> 
>         addq    __per_cpu_offset-8(,%rax,8), %rcx
> 
> because it still needs to clear the upper 32 bits and doesn't know
> that the 'xchg()' already did that.

Not only that, you need to do the 'subl' after converting to 64 bits.
Otherwise the wrong location is read were cpu_nr to be zero.
I've tried that - but it still failed.

> Oh well. I guess even without the -1/+1 games by the OSQ code, we
> would still end up with a "movl" just to do that upper bits clearing
> that the compiler doesn't know is unnecessary.
> 
> I don't think we have any reasonable way to tell the compiler that the
> register output of our xchg() inline asm has the upper 32 bits clear.

It could be done for a 32bit unsigned xchg() - just make the return
type unsigned 64bit.
But that won't work for the signed exchange - and 'atomic_t' is signed.
OTOH I'd guess this code could use 'unsigned int' instead of atomic_t?

	David

-
Registered Address Lakeside, Bramley Road, Mount Farm, Milton Keynes, MK1 1PT, UK
Registration No: 1397386 (Wales)