On Mon, 11 Sept 2023 at 03:38, David Laight <David.Laight@xxxxxxxxxx> wrote: > > The overhead of 'rep movbs' is about 36 clocks, 'rep movsq' only 16. Note that the hard case for 'rep movsq' is when the stores cross a cacheline (or worse yet, a page) boundary. That is what makes 'rep movsb' fundamentally simpler in theory. The natural reaction is "but movsq does things 8 bytes at a time", but once you start doing any kind of optimizations that are actually based on bigger areas, the byte counts are actually simpler. You can always do them as masked writes up to whatever boundary you like, and just restart. There are never any "what about the straddling bytes" issues. That's one of the dangers with benchmarking. Do you benchmark the unaligned cases? How much do they matter in real life? Do they even happen? And that's entirely ignoring any "cold vs hot caches" etc issues, or the "what is the cost of access _after_ the memcpy/memsert". Or, in the case of the kernel, our issues with "function calls can now be surprisingly expensive, and if we can inline things it can win back 20 cycles from a forced mispredict". (And yes, I mean _any_ function calls. The indirect function calls are even worse and more widely horrific, but sadly, with the return prediction issues, even a perfectly regular function call is no longer "a cycle or two") So beware microbenchmarks. That's true in general, but it's _particularly_ true of memset/memcpy. Linus