Re: [PATCH] MIPS: Fix a longstanding error in div64.h

YunQiang Su <wzssyqa@xxxxxxxxx> · Fri, 9 Apr 2021 15:17:04 +0800

Huacai Chen <chenhuacai@xxxxxxxxxx> 于2021年4月8日周四 下午12:56写道：
>
> Hi, Maciej,
>
> On Wed, Apr 7, 2021 at 9:38 PM Maciej W. Rozycki <macro@xxxxxxxxxxx> wrote:
> >
> > On Wed, 7 Apr 2021, Huacai Chen wrote:
> >
> > > >  This code is rather broken in an obvious way, starting from:
> > > >
> > > >         unsigned long long __n;                                         \
> > > >                                                                         \
> > > >         __high = *__n >> 32;                                            \
> > > >         __low = __n;                                                    \
> > > >
> > > > where `__n' is used uninitialised.  Since this is my code originally I'll
> > > > look into it; we may want to reinstate `do_div' too, which didn't have to
> > > > be removed in the first place.
> > > I think we can reuse the generic do_div().
> >
> >  We can, but it's not clear to me if this is optimal.  We have a DIVMOD
> > instruction which original code took advantage of (although I can see
> > potential in reusing bits from include/asm-generic/div64.h).  The two
> > implementations would have to be benchmarked against each other across a
> > couple of different CPUs.
> The original MIPS do_div() has "h" constraint, and this is also the
> reason why Ralf rewrote this file. How can we reintroduce do_div()
> without "h" constraint?
>

I try to figure out a new version:

uint32_t __attribute__ ((noinline)) div64_32n(uint64_t *x, uint32_t b) {
        uint64_t a = *x;

        uint64_t t1 = ((a>>32)/b)<<32;
        uint32_t t2 = (a>>32)%b;

        uint32_t res = (uint32_t)a;
        uint32_t t1lo = 0;

        uint32_t t3 = 0xffffffffu/b;
        uint32_t t4 = t3*b;
        uint32_t hi, lo;

        while(t2>0) {
                __asm__ (
                        "multu %2, %3\n"
                        "mfhi %0\n"
                        "mflo %1\n"
                        : "=r" (hi), "=r"(lo)
                        : "r" (t4), "r"(t2)
                );

                // yes, we are sure that t2*t3 will not overflow
                t1lo += (t3*t2);
                t2 -= hi;
                if (lo > 0) {
                        t2 --; // we are sure that t2 > 0
                        lo = 0xffffffff - lo + 1;
                        unsigned tmp = lo + res;
                        // overflow
                        if (tmp < lo || tmp < res) {
                                t2 ++;
                        }
                        res = tmp;
                }
        }
        if (res >= b) {
                t1lo += (res/b);
                res = (res%b);
        }

        t1 += t1lo;
        *x = t1;
        return res;
}

With some test the performace: ((uint64_t)(-1))/3 with 0xfffff times

GCC: 5555555555555555, 0, seconds: 5
SYQ: 5555555555555555, 0, seconds: 4
KER: 5555555555555555, 0, seconds: 8
RAL: ffffffff, 2, seconds: 4

1. the MIPS current asm version cost 4s (and wrong result)
2. the simplest C code : a/b && a % b, cost 5s
3. the asm-generic version cost 8s.
4. my version cost 4s.

And the question is why asm-generic version exists
since it has bad performance than the code generated by GCC?

> Huacai
> >
> > > >  Huacai, thanks for your investigation!  Please be more careful in
> > > > verifying your future submissions however.
> > > Sorry, I thought there is only one bug in div64.h, but in fact there
> > > are three...
> >
> >  This just shows the verification you made was not good enough, hence my
> > observation.
> >
> >   Maciej

-- 
YunQiang Su