Re: [PATCH 3/4] CodeSamples: Fix definition of cmpxchg() in api-gcc.h

Akira Yokosawa <akiyks@xxxxxxxxx> · Fri, 14 Dec 2018 00:33:15 +0900

On 2018/12/13 00:01:33 +0800, Junchang Wang wrote:
> On 12/11/18 11:42 PM, Akira Yokosawa wrote:
>> From 7e7c3a20d08831cd64b77a4e8d8f693b4725ef89 Mon Sep 17 00:00:00 2001
>> From: Akira Yokosawa <akiyks@xxxxxxxxx>
>> Date: Tue, 11 Dec 2018 21:37:11 +0900
>> Subject: [PATCH 3/4] CodeSamples: Fix definition of cmpxchg() in api-gcc.h
>>
>> Do the same change as CodeSamples/formal/litmus/api.h.
>>
>> Signed-off-by: Akira Yokosawa <akiyks@xxxxxxxxx>
>> ---
>>  CodeSamples/api-pthreads/api-gcc.h | 5 +++--
>>  1 file changed, 3 insertions(+), 2 deletions(-)
>>
>> diff --git a/CodeSamples/api-pthreads/api-gcc.h b/CodeSamples/api-pthreads/api-gcc.h
>> index 3afe340..b66f4b9 100644
>> --- a/CodeSamples/api-pthreads/api-gcc.h
>> +++ b/CodeSamples/api-pthreads/api-gcc.h
>> @@ -168,8 +168,9 @@ struct __xchg_dummy {
>>  ({ \
>>  	typeof(*ptr) _____actual = (o); \
>>  	\
>> -	__atomic_compare_exchange_n(ptr, (void *)&_____actual, (n), 1, \
>> -			__ATOMIC_SEQ_CST, __ATOMIC_SEQ_CST) ? (o) : (o)+1; \
>> +	__atomic_compare_exchange_n((ptr), (void *)&_____actual, (n), 0, \
>> +			__ATOMIC_SEQ_CST, __ATOMIC_RELAXED); \
>> +	_____actual; \
>>  })
>>  
> 
> Hi Akira,
> 
> Another reason that the performance of cmpxchg is catching up with cmpxchg_weak is that __ATOMIC_SEQ_CST is replaced by __ATOMIC_RELAXED in this patch. The use of __ATOMIC_RELAXED means if the CAS primitive fails, the relaxed semantic is used, rather than sequential consistent. Following are some experiment results:
> 
> # If __ATOMIC_RELAXED is used for both cmpxchg and cmpxchg_weak
> 
> ./count_lim_atomic 64 uperf
> ns/update: 290
> 
> ./count_lim_atomic_weak 64 uperf
> ns/update: 301
> 
> 
> # and then if __ATOMIC_SEQ_CST is used for both cmpxchg and cmpxchg_weak
> 
> ./count_lim_atomic 64 uperf
> ns/update: 316
> 
> ./count_lim_atomic_weak 64 uperf
> ns/update: 302
> 
> ./count_lim_atomic 120 uperf
> ns/update: 630
> 
> ./count_lim_atomic_weak 120 uperf
> ns/update: 568
> 
> The results show that if we want to ensure sequential consistency when the CAS primitive fails, cmpxchg_weak performs better than cmpxchg. It seems that the (type of variation, failure_memorder) pair affects performance. I know that PPC uses LL/SC to simulate CAS. But what's the relationship between a simulated CAS and the memory order. This is interesting because as far as I know, PPC and ARM are using LL/SC to simulate atomic primitives such as CAS and FAA. So FAA might have the same behavior.
> 
> In actually, I'm not very clear about the meaning of different types of failure memory orders. For example, when should we use __ATOMIC_RELAXED, rather than __ATOMIC_SEQ_CST, if a CAS fails? What happen if __ATOMIC_RELAXED is used for x86? The one I'm look at is https://gcc.gnu.org/onlinedocs/gcc/_005f_005fatomic-Builtins.html . Do you know some resources about this? I can look into this tomorrow. Thanks.

Hi Junchang,

In Linux-kernel speak, Documentation/core-api/atomic.rst says:

--------------------------------------------------------------------------
atomic_xchg must provide explicit memory barriers around the operation. ::

	int atomic_cmpxchg(atomic_t *v, int old, int new);

This performs an atomic compare exchange operation on the atomic value v,
with the given old and new values. Like all atomic_xxx operations,
atomic_cmpxchg will only satisfy its atomicity semantics as long as all
other accesses of \*v are performed through atomic_xxx operations.

atomic_cmpxchg must provide explicit memory barriers around the operation,
although if the comparison fails then no memory ordering guarantees are
required.

[snip]

The routines xchg() and cmpxchg() must provide the same exact
memory-barrier semantics as the atomic and bit operations returning
values.
--------------------------------------------------------------------------

The 2nd __ATOMIC_RELAXED to __atomic_compare_exchange_n() matches this
lack of requirement.

On x86_64, __atomic_compare_exchange_n() is translated to the same code
in both cases  (with the help of litmus7's cross compiling):

#START _litmus_P1
	xorl	%eax, %eax
	movl	$0, 4(%rsp)
	lock cmpxchgl	%r10d, (%rdx)
	je	.L36
	movl	%eax, 4(%rsp)
.L36:
	movl	4(%rsp), %eax

There is no difference between the code with __ATOMIC_RELAXED and
the code with __ATOMIC_SEQ_CST as the 2nd parameter. As you can see,
there is no memory barrier instruction emitted.

On PPC, there is a difference. With __ATOMIC_RELAXED as 2nd parameter,
the code looks like:

#START _litmus_P1
        sync
.L34:
        lwarx 7,0,9
        cmpwi 0,7,0
        bne 0,.L35
        stwcx. 5,0,9
        bne 0,.L34
        isync
.L35:

, OTOH with __ATOMIC_SEQ_CST as 2nd argument:

#START _litmus_P1
        sync
.L34:
        lwarx 7,0,9
        cmpwi 0,7,0
        bne 0,.L35
        stwcx. 5,0,9
        bne 0,.L34
.L35:
        isync

See the difference of position of label .L35.
(Note that we are talking about strong version of cmpxchg().)

Does the above example make sense to you?

        Thanks, Akira

> 
> 
> --Junchang
> 
> 
> 
>>  static __inline__ int atomic_cmpxchg(atomic_t *v, int old, int new)
>>