Huacai Chen <chenhuacai@xxxxxxxxxx> 于2021年4月8日周四 下午12:56写道: > > Hi, Maciej, > > On Wed, Apr 7, 2021 at 9:38 PM Maciej W. Rozycki <macro@xxxxxxxxxxx> wrote: > > > > On Wed, 7 Apr 2021, Huacai Chen wrote: > > > > > > This code is rather broken in an obvious way, starting from: > > > > > > > > unsigned long long __n; \ > > > > \ > > > > __high = *__n >> 32; \ > > > > __low = __n; \ > > > > > > > > where `__n' is used uninitialised. Since this is my code originally I'll > > > > look into it; we may want to reinstate `do_div' too, which didn't have to > > > > be removed in the first place. > > > I think we can reuse the generic do_div(). > > > > We can, but it's not clear to me if this is optimal. We have a DIVMOD > > instruction which original code took advantage of (although I can see > > potential in reusing bits from include/asm-generic/div64.h). The two > > implementations would have to be benchmarked against each other across a > > couple of different CPUs. > The original MIPS do_div() has "h" constraint, and this is also the > reason why Ralf rewrote this file. How can we reintroduce do_div() > without "h" constraint? > I try to figure out a new version: uint32_t __attribute__ ((noinline)) div64_32n(uint64_t *x, uint32_t b) { uint64_t a = *x; uint64_t t1 = ((a>>32)/b)<<32; uint32_t t2 = (a>>32)%b; uint32_t res = (uint32_t)a; uint32_t t1lo = 0; uint32_t t3 = 0xffffffffu/b; uint32_t t4 = t3*b; uint32_t hi, lo; while(t2>0) { __asm__ ( "multu %2, %3\n" "mfhi %0\n" "mflo %1\n" : "=r" (hi), "=r"(lo) : "r" (t4), "r"(t2) ); // yes, we are sure that t2*t3 will not overflow t1lo += (t3*t2); t2 -= hi; if (lo > 0) { t2 --; // we are sure that t2 > 0 lo = 0xffffffff - lo + 1; unsigned tmp = lo + res; // overflow if (tmp < lo || tmp < res) { t2 ++; } res = tmp; } } if (res >= b) { t1lo += (res/b); res = (res%b); } t1 += t1lo; *x = t1; return res; } With some test the performace: ((uint64_t)(-1))/3 with 0xfffff times GCC: 5555555555555555, 0, seconds: 5 SYQ: 5555555555555555, 0, seconds: 4 KER: 5555555555555555, 0, seconds: 8 RAL: ffffffff, 2, seconds: 4 1. the MIPS current asm version cost 4s (and wrong result) 2. the simplest C code : a/b && a % b, cost 5s 3. the asm-generic version cost 8s. 4. my version cost 4s. And the question is why asm-generic version exists since it has bad performance than the code generated by GCC? > Huacai > > > > > > Huacai, thanks for your investigation! Please be more careful in > > > > verifying your future submissions however. > > > Sorry, I thought there is only one bug in div64.h, but in fact there > > > are three... > > > > This just shows the verification you made was not good enough, hence my > > observation. > > > > Maciej -- YunQiang Su