On Mon, Feb 11, 2019 at 11:35:24AM -0500, Waiman Long wrote: > On 02/11/2019 06:58 AM, Peter Zijlstra wrote: > > Which is clearly worse. Now we can write that as: > > > > int __down_read_trylock2(unsigned long *l) > > { > > long tmp = READ_ONCE(*l); > > > > while (tmp >= 0) { > > if (try_cmpxchg(l, &tmp, tmp + 1)) > > return 1; > > } > > > > return 0; > > } > > > > which generates: > > > > 0000000000000030 <__down_read_trylock2>: > > 30: 48 8b 07 mov (%rdi),%rax > > 33: 48 85 c0 test %rax,%rax > > 36: 78 18 js 50 <__down_read_trylock2+0x20> > > 38: 48 8d 50 01 lea 0x1(%rax),%rdx > > 3c: f0 48 0f b1 17 lock cmpxchg %rdx,(%rdi) > > 41: 75 f0 jne 33 <__down_read_trylock2+0x3> > > 43: b8 01 00 00 00 mov $0x1,%eax > > 48: c3 retq > > 49: 0f 1f 80 00 00 00 00 nopl 0x0(%rax) > > 50: 31 c0 xor %eax,%eax > > 52: c3 retq > > > > Which is a lot better; but not quite there yet. > > > > > > I've tried quite a bit, but I can't seem to get GCC to generate the: > > > > add $1,%rdx > > jle > > > > required; stuff like: > > > > new = old + 1; > > if (new <= 0) > > > > generates: > > > > lea 0x1(%rax),%rdx > > test %rdx, %rdx > > jle > > Thanks for the suggested code snippet. So you want to replace "lea > 0x1(%rax), %rdx" by "add $1,%rdx"? > > I think the compiler is doing that so as to use the address generation > unit for addition instead of using the ALU. That will leave the ALU > available for doing other arithmetic operation in parallel. I don't > think it is a good idea to override the compiler and force it to use > ALU. So I am not going to try doing that. It is only 1 or 2 more of > codes anyway. Yeah, I was trying to see what I could make it do.. #2 really should be good enough, but you know how it is once you're poking at it :-)