On 2018/10/26 10:12:19 +0900, Akira Yokosawa wrote: > (from mobile, might be QP encoded) > > 2018/10/26 7:04、Akira Yokosawa <akiyks@xxxxxxxxx> wrote: > >>> On 2018/10/26 0:17, Akira Yokosawa wrote: >>>> On 2018/10/25 23:09, Junchang Wang wrote: >>>>> On Thu, Oct 25, 2018 at 5:45 PM Paul E. McKenney <paulmck@xxxxxxxxxxxxx> wrote: >>>>> >>>>>> On Thu, Oct 25, 2018 at 10:11:18AM +0800, Junchang Wang wrote: >>>>>> Hi Akira, >>>>>> >>>>>> Thanks for the mail. My understanding is that PPC uses LL/SC to >>>>>> emulate CAS by using a tiny loop. Unfortunately, the LL/SC loop itself >>>>>> could fail (due to, for example, context switches) even if *ptr equals >>>>>> to old. In such a case, a CAS instruction in actually should return a >>>>>> success. I think this is what the term "spurious fail" describes. Here >>>>>> is a reference: >>>>>> http://liblfds.org/mediawiki/index.php?title=Article:CAS_and_LL/SC_Implementation_Details_by_Processor_family >>>>> >>>>> First, thank you both for your work on this! And yes, my cmpxchg() code >>>>> is clearly quite broken. >>>>> >>>>>> It seems that __atomic_compare_exchange_n() provides option "weak" for >>>>>> performance. I tested these two solutions and got the following >>>>>> results: >>>>>> >>>>>> 1 4 8 16 32 64 >>>>>> my patch (ns) 35 34 37 73 142 281 >>>>>> strong (ns) 39 39 41 79 158 313 >>>>> >>>>> So strong is a bit slower, correct? >>>>> >>>>>> I tested the performance of count_lim_atomic by varying the number of >>>>>> updaters (./count_lim_atomic N uperf) on a 8-core PPC server. The >>>>>> first row in the table is the result when my patch is used, and the >>>>>> second row is the result when the 4th argument of the function is set >>>>>> to false(0). It seems performance improves slightly if option "weak" >>>>>> is used. However, there is no performance boost as we expected. So >>>>>> your solution sounds good if safety is one major concern because >>>>>> option "weak" seems risky to me :-) >>>>>> >>>>>> Another interesting observation is that the performance of LL/SC-based >>>>>> CAS instruction deteriorates dramatically when the number of working >>>>>> threads exceeds the number of CPU cores. >>>>> >>>>> If weak is faster, would it make sense to return (~o), that is, >>>>> the bitwise complement of the expected arguement, when the weak >>>>> __atomic_compare_exchange_n() fails? This would get the improved >>>>> performance (if I understand your results above) while correctly handling >>>>> the strange (but possible) case where o==n. >>>>> >>>>> Does that make sense, or am I missing something? >>>> >>>> Hi Paul and Akira, >>>> >>>> Yes, the weak version is faster. The solution looks good. But when I >>>> tried to use the following patch >>>> >>>> #define cmpxchg(ptr, o, n) \ >>>> ({ \ >>>> typeof(*ptr) old = (o); \ >>>> (__atomic_compare_exchange_n(ptr, (void *)&old, (n), 1, >>> >>> You need a "\" at the end of the line above. (If it was not unintentionally >>> wrapped.) >>> >>> If it was wrapped by your mailer, which is troublesome in sending patches, >>> please refer to: >>> >>> https://www.kernel.org/doc/html/latest/process/email-clients.html. >>> >>>> __ATOMIC_SEQ_CST, __ATOMIC_SEQ_CST))? \ >>>> (o) : (~o); \ >>>> }) >>>> >>>> gcc complains of my use of complement symbol >>>> >>>> ../api.h:769:12: error: wrong type argument to bit-complement >>>> (o) : (~o); \ >>>> ^ >>>> >>>> Any suggestions? >>> >>> I don't see such error if I add the "\" mentioned above. >>> Or do you use some strict error checking option of GCC? >> >> Ah, I see that the error in compiling CodeSamples/advsync/q.c. >> >> The call site is: >> >> struct el *q_pop(struct queue *q) >> { >> struct el *p; >> struct el *pnext; >> >> for (;;) { >> do { >> p = q->head; >> pnext = p->next; >> if (pnext == NULL) >> return NULL; >> } while (cmpxchg(&q->head, p, pnext) != p); >> if (p != &q->dummy) >> return p; >> q_push(&q->dummy, q); >> if (q->head == &q->dummy) >> return NULL; >> } >> } >> >> In this case, p and pnext are pointers, hence the error. >> returning (o)+1 instead should be OK in this case. >> >> But now, "count_lim_atomic 3 hog" says: >> >> FAIL: only reached -1829 rather than 0 >> >> on x86_64. Hmm. No such error is observed on POWER8. >> Hmm... > > I tried the same source with (o)+1 on another x86_64 host (Ubuntu 16.04). > > count_lim_atomic hog test runs flawlessly. > > The host I tried earlier was a bit old SB laptop. I’ll try there again later. And I _did_ make a stupid typo there. I tried with "(o+1)", not "(o)+1". Apologies... So all clear for Junchang's patch with Paul's minor tweak. FWIW, diff of my version looks like the following: diff --git a/CodeSamples/api-pthreads/api-gcc.h b/CodeSamples/api-pthreads/api-gcc.h index 1dd26ca..2e65998 100644 --- a/CodeSamples/api-pthreads/api-gcc.h +++ b/CodeSamples/api-pthreads/api-gcc.h @@ -168,9 +168,8 @@ struct __xchg_dummy { ({ \ typeof(*ptr) _____actual = (o); \ \ - (void)__atomic_compare_exchange_n(ptr, (void *)&_____actual, (n), 1, \ - __ATOMIC_SEQ_CST, __ATOMIC_SEQ_CST); \ - _____actual; \ + __atomic_compare_exchange_n(ptr, (void *)&_____actual, (n), 1, \ + __ATOMIC_SEQ_CST, __ATOMIC_SEQ_CST) ? (o) : (o)+1; \ }) static __inline__ int atomic_cmpxchg(atomic_t *v, int old, int new) --- Thanks, Akira > > Thanks, Akira > >> >> The strong version works both on x86_64 and POWER8. >> >> Thanks, Akira >> >>> >>> Thanks, Akira >>> >>>> >>>> Thanks, >>>> --Junchang >>>> >>>> >>>>> >>>>> Thanx, Paul >>>>> >>>>>> Thanks, >>>>>> --Junchang >>>>>> >>>>>> >>>>>>> On Thu, Oct 25, 2018 at 6:05 AM Akira Yokosawa <akiyks@xxxxxxxxx> wrote: >>>>>>> >>>>>>>> On 2018/10/24 23:53:29 +0800, Junchang Wang wrote: >>>>>>>> Hi Akira and Paul, >>>>>>>> >>>>>>>> I checked the code today and the implementation of cmpxchg() looks >>>>>>>> very suspicious to me; Current cmpxchg() first executes function >>>>>>>> __atomic_compare_exchange_n, and then checks whether the value stored >>>>>>>> in field __actual (old) has been changed to decide if the CAS >>>>>>>> instruction has been successfully performed. However, I saw the *weak* >>>>>>>> field is set, which, as far as I know, means >>>>>>>> __atomic_compare_exchange_n could fail even if the value of *ptr is >>>>>>>> equal to __actual (old). Unfortunately, current cmpxchg will treat >>>>>>>> this case as a success because the value of __actual(old) does not >>>>>>>> change. >>>>>>> >>>>>>> Thanks for looking into this! >>>>>>> >>>>>>> I also suspected the use of "weak" semantics of >>>>>>> __atomic_compare_exchange_n(), but did not quite understand what >>>>>>> "spurious fail" actually means. Your theory sounds plausible to me. >>>>>>> >>>>>>> I've suggested in a private email to Paul to modify the 4th argument >>>>>>> to false(0) as a workaround, which would prevent such "spurious fail". >>>>>>> >>>>>>> Both approaches looks good to me. I'd defer to Paul on the choice. >>>>>>> >>>>>>> Thanks, Akira >>>>>>> >>>>>>>> >>>>>>>> This bug happens in both Power8 and ARMv8. It seems it affects >>>>>>>> architectures that use LL/SC to emulate CAS. Following patch helps >>>>>>>> solve this issue on my testbeds. Please take a look. Any thoughts? >>>>>>>> >>>>>>>> --- >>>>>>>> CodeSamples/api-pthreads/api-gcc.h | 8 +++----- >>>>>>>> 1 file changed, 3 insertions(+), 5 deletions(-) >>>>>>>> >>>>>>>> diff --git a/CodeSamples/api-pthreads/api-gcc.h >>>>>>>> b/CodeSamples/api-pthreads/api-gcc.h >>>>>>>> index 1dd26ca..38a16c0 100644 >>>>>>>> --- a/CodeSamples/api-pthreads/api-gcc.h >>>>>>>> +++ b/CodeSamples/api-pthreads/api-gcc.h >>>>>>>> @@ -166,11 +166,9 @@ struct __xchg_dummy { >>>>>>>> >>>>>>>> #define cmpxchg(ptr, o, n) \ >>>>>>>> ({ \ >>>>>>>> - typeof(*ptr) _____actual = (o); \ >>>>>>>> - \ >>>>>>>> - (void)__atomic_compare_exchange_n(ptr, (void *)&_____actual, (n), 1, \ >>>>>>>> - __ATOMIC_SEQ_CST, __ATOMIC_SEQ_CST); \ >>>>>>>> - _____actual; \ >>>>>>>> + typeof(*ptr) old = (o); \ >>>>>>>> + (__atomic_compare_exchange_n(ptr, (void *)&old, (n), 1, >>>>>>>> __ATOMIC_SEQ_CST, __ATOMIC_SEQ_CST))? \ >>>>>>>> + (o) : (n); \ >>>>>>>> }) >>>>>>>> >>>>>>>> static __inline__ int atomic_cmpxchg(atomic_t *v, int old, int new) >>>>>>>> >>>>>>> >>>>>> >>>>> >>> >>