On Thu, Jul 23, 2020 at 01:54:47PM +0000, David Laight wrote: > From: Al Viro > > Sent: 22 July 2020 18:39 > > I would love to see your patch, anyway, along with the testcases and performance > > comparison. > > See attached program. > Compile and run (as root): csum_iov 1 > > Unpatched (as shipped) 16 vectors of 1 byte take ~430 clocks on my haswell cpu. > With dsl_patch defined they take ~393. > > The maximum throughput is ~1.16 clocks/word for 16 vectors of 1k. > For longer vectors the data gets lost from the cache between the iterations. > > On an older Ivy Bridge cpu it never goes faster than 2 clocks/word. > (Due to the implementation of ADC.) > > The absolute limit is 1 clock/word - limited by the memory write. > I suspect that is achievable on Haswell with much less loop unrolling. > > I had to replace the ror32() with __builtin_bswap32(). > The kernel object do contain the 'ror' instruction - even though I > didn't find the asm for it. First of all, ; git grep -n -w ror32|grep '\.h:' include/linux/bitops.h:109: * ror32 - rotate a 32-bit value right include/linux/bitops.h:113:static inline __u32 ror32(__u32 word, unsigned int shift) include/net/checksum.h:81: sum = ror32(sum, 8); ; grep -A3 ror32 include/linux/bitops.h * ror32 - rotate a 32-bit value right * @word: value to rotate * @shift: bits to roll */ static inline __u32 ror32(__u32 word, unsigned int shift) { return (word >> (shift & 31)) | (word << ((-shift) & 31)); } ; cat >/tmp/a.c <<'EOF' unsigned f(unsigned n) { return (n >> 8) | (n << 24); } EOF ; gcc -c -O2 /tmp/a.c -o /tmp/a.o ; objdump /tmp/a.o /tmp/a.o: file format elf64-x86-64 Disassembly of section .text: 0000000000000000 <f>: 0: 89 f8 mov %edi,%eax 2: c1 c8 08 ror $0x8,%eax 5: c3 retq ; which ought to cover _that_ question. Takes a couple of minutes, but that's a trivial side issue. Said that, what you've printed for 1-byte segments (and that's going to be seriously affected by the setup costs in csum-copy.S, sensitive to calling convention changes) is time to run the 16-iteration loop divided by 1 * 16 / 8; IOW, your difference for 16 iterations here is 37*2 = 74 cycles. With per-iteration diff being a bit under 5 cycles. Which is not implausible, but 1) extrapolating to other compiler versions, flags, etc. is not obvious 2) the effects of calling convention changes need to be taken into account 3) for copying to/from userland the effects of calling convention changes are be even larger, and kernel is certainly not going to issue kvec iters of _that_ sort, TYVM.