On Sat, Dec 15, 2018 at 10:58 PM Akira Yokosawa <akiyks@xxxxxxxxx> wrote: > > On 2018/12/14 22:32:22 +0800, Junchang Wang wrote: > > On Thu, Dec 13, 2018 at 11:33 PM Akira Yokosawa <akiyks@xxxxxxxxx> wrote: > >> > >> On 2018/12/13 00:01:33 +0800, Junchang Wang wrote: > >>> On 12/11/18 11:42 PM, Akira Yokosawa wrote: > >>>> From 7e7c3a20d08831cd64b77a4e8d8f693b4725ef89 Mon Sep 17 00:00:00 2001 > >>>> From: Akira Yokosawa <akiyks@xxxxxxxxx> > >>>> Date: Tue, 11 Dec 2018 21:37:11 +0900 > >>>> Subject: [PATCH 3/4] CodeSamples: Fix definition of cmpxchg() in api-gcc.h > >>>> > >>>> Do the same change as CodeSamples/formal/litmus/api.h. > >>>> > >>>> Signed-off-by: Akira Yokosawa <akiyks@xxxxxxxxx> > >>>> --- > >>>> CodeSamples/api-pthreads/api-gcc.h | 5 +++-- > >>>> 1 file changed, 3 insertions(+), 2 deletions(-) > >>>> > >>>> diff --git a/CodeSamples/api-pthreads/api-gcc.h b/CodeSamples/api-pthreads/api-gcc.h > >>>> index 3afe340..b66f4b9 100644 > >>>> --- a/CodeSamples/api-pthreads/api-gcc.h > >>>> +++ b/CodeSamples/api-pthreads/api-gcc.h > >>>> @@ -168,8 +168,9 @@ struct __xchg_dummy { > >>>> ({ \ > >>>> typeof(*ptr) _____actual = (o); \ > >>>> \ > >>>> - __atomic_compare_exchange_n(ptr, (void *)&_____actual, (n), 1, \ > >>>> - __ATOMIC_SEQ_CST, __ATOMIC_SEQ_CST) ? (o) : (o)+1; \ > >>>> + __atomic_compare_exchange_n((ptr), (void *)&_____actual, (n), 0, \ > >>>> + __ATOMIC_SEQ_CST, __ATOMIC_RELAXED); \ > >>>> + _____actual; \ > >>>> }) > >>>> > >>> > >>> Hi Akira, > >>> > >>> Another reason that the performance of cmpxchg is catching up with cmpxchg_weak is that __ATOMIC_SEQ_CST is replaced by __ATOMIC_RELAXED in this patch. The use of __ATOMIC_RELAXED means if the CAS primitive fails, the relaxed semantic is used, rather than sequential consistent. Following are some experiment results: > >>> > >>> # If __ATOMIC_RELAXED is used for both cmpxchg and cmpxchg_weak > >>> > >>> ./count_lim_atomic 64 uperf > >>> ns/update: 290 > >>> > >>> ./count_lim_atomic_weak 64 uperf > >>> ns/update: 301 > >>> > >>> > >>> # and then if __ATOMIC_SEQ_CST is used for both cmpxchg and cmpxchg_weak > >>> > >>> ./count_lim_atomic 64 uperf > >>> ns/update: 316 > >>> > >>> ./count_lim_atomic_weak 64 uperf > >>> ns/update: 302 > >>> > >>> ./count_lim_atomic 120 uperf > >>> ns/update: 630 > >>> > >>> ./count_lim_atomic_weak 120 uperf > >>> ns/update: 568 > >>> > >>> The results show that if we want to ensure sequential consistency when the CAS primitive fails, cmpxchg_weak performs better than cmpxchg. It seems that the (type of variation, failure_memorder) pair affects performance. I know that PPC uses LL/SC to simulate CAS. But what's the relationship between a simulated CAS and the memory order. This is interesting because as far as I know, PPC and ARM are using LL/SC to simulate atomic primitives such as CAS and FAA. So FAA might have the same behavior. > >>> > >>> In actually, I'm not very clear about the meaning of different types of failure memory orders. For example, when should we use __ATOMIC_RELAXED, rather than __ATOMIC_SEQ_CST, if a CAS fails? What happen if __ATOMIC_RELAXED is used for x86? The one I'm look at is https://gcc.gnu.org/onlinedocs/gcc/_005f_005fatomic-Builtins.html . Do you know some resources about this? I can look into this tomorrow. Thanks. > >> > >> Hi Junchang, > >> > >> In Linux-kernel speak, Documentation/core-api/atomic.rst says: > >> > >> -------------------------------------------------------------------------- > >> atomic_xchg must provide explicit memory barriers around the operation. :: > >> > >> int atomic_cmpxchg(atomic_t *v, int old, int new); > >> > >> This performs an atomic compare exchange operation on the atomic value v, > >> with the given old and new values. Like all atomic_xxx operations, > >> atomic_cmpxchg will only satisfy its atomicity semantics as long as all > >> other accesses of \*v are performed through atomic_xxx operations. > >> > >> atomic_cmpxchg must provide explicit memory barriers around the operation, > >> although if the comparison fails then no memory ordering guarantees are > >> required. > > > > Hi Akira, > > > > Thanks for the link which is helpful. > > > >> > >> [snip] > >> > >> The routines xchg() and cmpxchg() must provide the same exact > >> memory-barrier semantics as the atomic and bit operations returning > >> values. > >> -------------------------------------------------------------------------- > >> > >> The 2nd __ATOMIC_RELAXED to __atomic_compare_exchange_n() matches this > >> lack of requirement. > >> > >> On x86_64, __atomic_compare_exchange_n() is translated to the same code > >> in both cases (with the help of litmus7's cross compiling): > >> > >> #START _litmus_P1 > >> xorl %eax, %eax > >> movl $0, 4(%rsp) > >> lock cmpxchgl %r10d, (%rdx) > >> je .L36 > >> movl %eax, 4(%rsp) > >> .L36: > >> movl 4(%rsp), %eax > >> > >> There is no difference between the code with __ATOMIC_RELAXED and > >> the code with __ATOMIC_SEQ_CST as the 2nd parameter. As you can see, > >> there is no memory barrier instruction emitted. > > > > My understanding is that x86 is using the TSO memory model, such that > > it is unnecessary to add extra barriers. Is that right? > > I think so. > > > > >> > >> On PPC, there is a difference. With __ATOMIC_RELAXED as 2nd parameter, > >> the code looks like: > >> > >> #START _litmus_P1 > >> sync > >> .L34: > >> lwarx 7,0,9 > >> cmpwi 0,7,0 > >> bne 0,.L35 > >> stwcx. 5,0,9 > >> bne 0,.L34 > >> isync > >> .L35: > >> > >> , OTOH with __ATOMIC_SEQ_CST as 2nd argument: > >> > >> #START _litmus_P1 > >> sync > >> .L34: > >> lwarx 7,0,9 > >> cmpwi 0,7,0 > >> bne 0,.L35 > >> stwcx. 5,0,9 > >> bne 0,.L34 > >> .L35: > >> isync > >> > >> See the difference of position of label .L35. > >> (Note that we are talking about strong version of cmpxchg().) > >> > >> Does the above example make sense to you? > >> > > > > Yes. It makes sense. > > > > For curiosity, I checked the assembly code of weak atomic_cmpxchg (the > > fourth argument is set to 1) with __ATOMIC_SEQ_CST. The code is shown > > below: > > > > #START _litmus_P3 > > sync > > lwarx 9,0,8 > > cmpwi 0,9,1 > > bne 0,.L34 > > stwcx. 4,0,8 > > .L34: > > > > The code shows that the weak atomic_cmpxchg fails because (1) the > > content of *ptr is not equal to argument old, or (2) the store > > operation fails because this thread loses reservation for the memory > > location referenced by ptr. In contrast, the strong atomic_cmpxchg > > (its assembly code is shown below) contains a while-loop and fails > > only if (1) *ptr is not equal to argument old. So the weak > > atomic_cmpxchg could fail "spuriously" and it is the caller's > > responsibility to retry it. > > > > #START _litmus_P1 > > sync > > .L34: > > lwarx 7,0,9 > > cmpwi 0,7,0 > > bne 0,.L35 > > stwcx. 5,0,9 > > bne 0,.L34 > > .L35: > > isync > > > > So for the performance comparison, my hypothesis is that the weak > > version may perform slightly better at high levels of contention > > because if there is a "spuriously" failure, atomic_cmpxchg returns > > immediately, which gives other threads chances to successfully perform > > their atomic_cmpxchg instructions. For other cases, the strong> atomic_cmpxchg works pretty well because it avoids rebuilding > > arguments old and new. Does that make sense? > > Well, as atomic_cmpxchg() is supposed to be inlined, > it is not easy to tell how the code of count_lim_atomic.c will be > optimized in the end. > The performance can vary depending on compiler version. Got it. Thanks a lot :-). --Junchang > > The point of this sample code is that it scales much better than > count_atomic.c. In the end, we need to minimize the contention > of atomic accesses, don't we? > > Thanks, Akira > > > > > Thanks, > > --Junchang > > > >> Thanks, Akira > >> > >>> > >>> > >>> --Junchang > >>> > >>> > >>> > >>>> static __inline__ int atomic_cmpxchg(atomic_t *v, int old, int new) > >>>> > >> >