... > We can do better than this! By inspection this looks like a performance > regression. The generic version of csum_fold in > include/asm-generic/checksum.h is better than this so should be used > instead. Yes, that got changed for 6.8-rc1 (I pretty much suggested the patch) but hadn't noticed Linus has applied it. That C version is (probably) not worse than any of the asm versions except sparc32 - which has a carry flag but rotate. (It is better than the x86-64 asm one.) ... > This doesn't leverage add with carry well. This causes the code size of this > to be dramatically larger than the original assembly, which I assume > nicely correlates to an increased execution time. It is pretty much impossible to do add with carry from C. So an asm adc block is pretty much always going to win. For csum_partial and short to moderate length buffers on x86 it is hard to beat 10: adc, adc, dec, jnz 10b which (on modern intel cpu at least) does 8 bytes/clock. You can get 12 bytes/clock but it only really wins for 256+ bytes. (See the current x86-64 version.) For cpu without a carry flag it is likely that a common C function will be pretty much optimal on all architectures. (Or maybe a couple of implementations based the actual cpu implementation - not the architecture.) Mostly I don't think you can beat 4 instructions/word, but they will pipeline so with multi-issue you might get a read/clock. Arm's barrel shifter might give 3: v + *p; x += v, y += v >> 32. David - Registered Address Lakeside, Bramley Road, Mount Farm, Milton Keynes, MK1 1PT, UK Registration No: 1397386 (Wales)