... > When I was playing with this stuff about 5 years ago I found 32-byte > loops to be optimal for uarchs of the priod (Skylake, Broadwell, > Haswell and so on), but only up to a point where rep wins. Does the 'rep movsq' ever actually win? (Unless you find one of the EMRS (or similar) versions.) IIRC it only ever does one iteration per clock - and you should be able to match that with a carefully constructed loop. Many years ago I got my Athlon-700 to execute a copy loop as fast as 'rep movs' - but the setup times were longer. The killer for 'rep movs' setup was always P4-netburst - over 40 clocks. But I think some of the more recent cpu are still in double figures (apart from some optimised copies). So I'm not actually sure you should ever need to switch to a 'rep movsq' loop - but I've not tried to write it. I did have to unroll the ip-cksum loop 4 times (as): + asm( " bt $4, %[len]\n" + " jnc 10f\n" + " add (%[buff], %[len]), %[sum_0]\n" + " adc 8(%[buff], %[len]), %[sum_1]\n" + " lea 16(%[len]), %[len]\n" + "10: jecxz 20f\n" // %[len] is %rcx + " adc (%[buff], %[len]), %[sum_0]\n" + " adc 8(%[buff], %[len]), %[sum_1]\n" + " lea 32(%[len]), %[len_tmp]\n" + " adc 16(%[buff], %[len]), %[sum_0]\n" + " adc 24(%[buff], %[len]), %[sum_1]\n" + " mov %[len_tmp], %[len]\n" + " jmp 10b\n" + "20: adc %[sum_0], %[sum]\n" + " adc %[sum_1], %[sum]\n" + " adc $0, %[sum]\n" In order to get one adc every clock. But only because of the strange loop required to 'loop carry' the carry flag (the 'loop' instruction is OK on AMD cpu, but not on Intel.) A similar loop using adox and adcx will beat one read/clock provided it is unrolled again. (IIRC I got to about 12 bytes/clock.) David - Registered Address Lakeside, Bramley Road, Mount Farm, Milton Keynes, MK1 1PT, UK Registration No: 1397386 (Wales)