Re: [Possible BUG] count_lim_atomic.c fails on POWER8

Junchang Wang <junchangwang@xxxxxxxxx> · Fri, 26 Oct 2018 08:58:30 +0800

On Fri, Oct 26, 2018 at 6:04 AM Akira Yokosawa <akiyks@xxxxxxxxx> wrote:
>
> On 2018/10/26 0:17, Akira Yokosawa wrote:
> > On 2018/10/25 23:09, Junchang Wang wrote:
> >> On Thu, Oct 25, 2018 at 5:45 PM Paul E. McKenney <paulmck@xxxxxxxxxxxxx> wrote:
> >>>
> >>> On Thu, Oct 25, 2018 at 10:11:18AM +0800, Junchang Wang wrote:
> >>>> Hi Akira,
> >>>>
> >>>> Thanks for the mail. My understanding is that PPC uses LL/SC to
> >>>> emulate CAS by using a tiny loop. Unfortunately, the LL/SC loop itself
> >>>> could fail (due to, for example, context switches) even if *ptr equals
> >>>> to old. In such a case, a CAS instruction in actually should return a
> >>>> success. I think this is what the term "spurious fail" describes. Here
> >>>> is a reference:
> >>>> http://liblfds.org/mediawiki/index.php?title=Article:CAS_and_LL/SC_Implementation_Details_by_Processor_family
> >>>
> >>> First, thank you both for your work on this!  And yes, my cmpxchg() code
> >>> is clearly quite broken.
> >>>
> >>>> It seems that __atomic_compare_exchange_n() provides option "weak" for
> >>>> performance. I tested these two solutions and got the following
> >>>> results:
> >>>>
> >>>>                            1      4      8     16     32    64
> >>>> my patch (ns)     35    34    37    73    142  281
> >>>> strong (ns)          39    39    41    79    158  313
> >>>
> >>> So strong is a bit slower, correct?
> >>>
> >>>> I tested the performance of count_lim_atomic by varying the number of
> >>>> updaters (./count_lim_atomic N uperf) on a 8-core PPC server. The
> >>>> first row in the table is the result when my patch is used, and the
> >>>> second row is the result when the 4th argument of the function is set
> >>>> to false(0). It seems performance improves slightly if option "weak"
> >>>> is used. However, there is no performance boost as we expected. So
> >>>> your solution sounds good if safety is one major concern because
> >>>> option "weak" seems risky to me :-)
> >>>>
> >>>> Another interesting observation is that the performance of LL/SC-based
> >>>> CAS instruction deteriorates dramatically when the number of working
> >>>> threads exceeds the number of CPU cores.
> >>>
> >>> If weak is faster, would it make sense to return (~o), that is,
> >>> the bitwise complement of the expected arguement, when the weak
> >>> __atomic_compare_exchange_n() fails?  This would get the improved
> >>> performance (if I understand your results above) while correctly handling
> >>> the strange (but possible) case where o==n.
> >>>
> >>> Does that make sense, or am I missing something?
> >>
> >> Hi Paul and Akira,
> >>
> >> Yes, the weak version is faster. The solution looks good. But when I
> >> tried to use the following patch
> >>
> >> #define cmpxchg(ptr, o, n) \
> >> ({ \
> >>         typeof(*ptr) old = (o); \
> >>         (__atomic_compare_exchange_n(ptr, (void *)&old, (n), 1,
> >
> > You need a "\" at the end of the line above. (If it was not unintentionally
> > wrapped.)
> >
> > If it was wrapped by your mailer, which is troublesome in sending patches,
> > please refer to:
> >
> > https://www.kernel.org/doc/html/latest/process/email-clients.html.
> >
> >> __ATOMIC_SEQ_CST, __ATOMIC_SEQ_CST))? \
> >>                                 (o) : (~o); \
> >> })
> >>
> >> gcc complains of my use of complement symbol
> >>
> >> ../api.h:769:12: error: wrong type argument to bit-complement
> >>      (o) : (~o); \
> >>               ^
> >>
> >> Any suggestions?
> >
> > I don't see such error if I add the "\" mentioned above.
> > Or do you use some strict error checking option of GCC?
>
> Ah, I see that the error in compiling CodeSamples/advsync/q.c.
>
> The call site is:
>
> struct el *q_pop(struct queue *q)
> {
>         struct el *p;
>         struct el *pnext;
>
>         for (;;) {
>                 do {
>                         p = q->head;
>                         pnext = p->next;
>                         if (pnext == NULL)
>                                 return NULL;
>                 } while (cmpxchg(&q->head, p, pnext) != p);
>                 if (p != &q->dummy)
>                         return p;
>                 q_push(&q->dummy, q);
>                 if (q->head == &q->dummy)
>                         return NULL;
>         }
> }
>
> In this case, p and pnext are pointers, hence the error.
> returning (o)+1 instead should be OK in this case.
>

Hi Akira and Paul,

Returning (o)+1 if the CAS primitive fails works fine. Thanks a lot!

> But now, "count_lim_atomic 3 hog" says:
>
>     FAIL: only reached -1829 rather than 0
>
> on x86_64. Hmm. No such error is observed on POWER8.
> Hmm...
>
> The strong version works both on x86_64 and POWER8.

Indeed (cry) ! Thanks for checking this. I will take a look and work on this.

BTW, I found I'm not good in writing C macro (e.g., cmpxchg). Do you
know some specification/document on writing C macro functions in
Linux?

Thanks,
--Junchang

>
>         Thanks, Akira
>
> >
> >         Thanks, Akira
> >
> >>
> >> Thanks,
> >> --Junchang
> >>
> >>
> >>>
> >>>                                                         Thanx, Paul
> >>>
> >>>> Thanks,
> >>>> --Junchang
> >>>>
> >>>>
> >>>> On Thu, Oct 25, 2018 at 6:05 AM Akira Yokosawa <akiyks@xxxxxxxxx> wrote:
> >>>>>
> >>>>> On 2018/10/24 23:53:29 +0800, Junchang Wang wrote:
> >>>>>> Hi Akira and Paul,
> >>>>>>
> >>>>>> I checked the code today and the implementation of cmpxchg() looks
> >>>>>> very suspicious to me; Current  cmpxchg() first executes function
> >>>>>> __atomic_compare_exchange_n, and then checks whether the value stored
> >>>>>> in field __actual (old) has been changed to decide if the CAS
> >>>>>> instruction has been successfully performed. However, I saw the *weak*
> >>>>>> field is set, which, as far as I know, means
> >>>>>> __atomic_compare_exchange_n could fail even if the value of *ptr is
> >>>>>> equal to __actual (old). Unfortunately, current cmpxchg will treat
> >>>>>> this case as a success because the value of __actual(old) does not
> >>>>>> change.
> >>>>>
> >>>>> Thanks for looking into this!
> >>>>>
> >>>>> I also suspected the use of "weak" semantics of
> >>>>> __atomic_compare_exchange_n(), but did not quite understand what
> >>>>> "spurious fail" actually means. Your theory sounds plausible to me.
> >>>>>
> >>>>> I've suggested in a private email to Paul to modify the 4th argument
> >>>>> to false(0) as a workaround, which would prevent such "spurious fail".
> >>>>>
> >>>>> Both approaches looks good to me. I'd defer to Paul on the choice.
> >>>>>
> >>>>>         Thanks, Akira
> >>>>>
> >>>>>>
> >>>>>> This bug happens in both Power8 and ARMv8. It seems it affects
> >>>>>> architectures that use LL/SC to emulate CAS. Following patch helps
> >>>>>> solve this issue on my testbeds. Please take a look. Any thoughts?
> >>>>>>
> >>>>>> ---
> >>>>>>  CodeSamples/api-pthreads/api-gcc.h | 8 +++-----
> >>>>>>  1 file changed, 3 insertions(+), 5 deletions(-)
> >>>>>>
> >>>>>> diff --git a/CodeSamples/api-pthreads/api-gcc.h
> >>>>>> b/CodeSamples/api-pthreads/api-gcc.h
> >>>>>> index 1dd26ca..38a16c0 100644
> >>>>>> --- a/CodeSamples/api-pthreads/api-gcc.h
> >>>>>> +++ b/CodeSamples/api-pthreads/api-gcc.h
> >>>>>> @@ -166,11 +166,9 @@ struct __xchg_dummy {
> >>>>>>
> >>>>>>  #define cmpxchg(ptr, o, n) \
> >>>>>>  ({ \
> >>>>>> -       typeof(*ptr) _____actual = (o); \
> >>>>>> -       \
> >>>>>> -       (void)__atomic_compare_exchange_n(ptr, (void *)&_____actual, (n), 1, \
> >>>>>> -                                         __ATOMIC_SEQ_CST, __ATOMIC_SEQ_CST); \
> >>>>>> -       _____actual; \
> >>>>>> +       typeof(*ptr) old = (o); \
> >>>>>> +       (__atomic_compare_exchange_n(ptr, (void *)&old, (n), 1,
> >>>>>> __ATOMIC_SEQ_CST, __ATOMIC_SEQ_CST))? \
> >>>>>> +                               (o) : (n); \
> >>>>>>  })
> >>>>>>
> >>>>>>  static __inline__ int atomic_cmpxchg(atomic_t *v, int old, int new)
> >>>>>>
> >>>>>
> >>>>
> >>>
> >
>