From: Maciej W. Rozycki > Sent: 17 March 2021 15:36 .. > > > Not that I grok the mips opcodes. > > > But that code has horridness on its side. > > It's a 32-bit one's-complement addition. The use of 64-bit operations > reduces the number of calculations as any 32-bit carries accumulate in the > high 32-bit word allowing one instruction to be saved total compared to > the 32-bit variant. Nothing particularly unusual for me here; I've seen > worse stuff with x86. The 'problem' is that mips doesn't have a carry flag. So the 64-bit maths is 'tricky'. It may well be that a loop based on: do { val = *ptr++; sum += val; carry += sum < val; } while (ptr != limit) will generate much better code. I think there is a 'setlt' instruction for the compare. It certainly would on the nios (which is mips-like). That is (probably) 6 instructions for 4 bytes. I suspect there may be a data stall after the memory read. So an interleaved unroll would remove that stall. That would be 10 clocks for 8 bytes. The x86-64 code is 'interesting'. It has repeated 'add carry' instructions. On Intel cpus prior to (at least) Haswell they take two clocks each. So the code is no faster than adding 32bit values to a 64bit sum. David - Registered Address Lakeside, Bramley Road, Mount Farm, Milton Keynes, MK1 1PT, UK Registration No: 1397386 (Wales)