On Thu, Jul 23, 2020 at 03:53:42PM +0100, Al Viro wrote: > Said that, what you've printed for 1-byte segments (and that's going to be > seriously affected by the setup costs in csum-copy.S, sensitive to calling > convention changes) is time to run the 16-iteration loop divided by 1 * 16 / 8; > IOW, your difference for 16 iterations here is 37*2 = 74 cycles. With > per-iteration diff being a bit under 5 cycles. Which is not implausible, > but > 1) extrapolating to other compiler versions, flags, etc. is not obvious > 2) the effects of calling convention changes need to be taken into account > 3) for copying to/from userland the effects of calling convention changes > are be even larger, and kernel is certainly not going to issue kvec iters of _that_ > sort, TYVM. To clarify it a bit: the effects of calling conventions change are mostly due to not passing (and saving) those error pointers, and that could be had with "pass the initial sum in" - just start these iov_iter.c loops with sum = ~0U and we get the same warranties re not getting 0 in absence of faults. The point is, your "~4.5 cycles per vector" is pretty much noise and the difference between the 3-argument and 4-argument variants could easily be in the same range. It might be a valid microoptimization, it might be not. 3-argument variant is simpler and IMO in absence of strong data we ought to go with that.