On 9/1/23, David Laight <David.Laight@xxxxxxxxxx> wrote: > From: Mateusz Guzik >> Sent: 30 August 2023 15:03 > ... >> Hand-rolled mov loops executing in this case are quite pessimal compared >> to rep movsq for bigger sizes. While the upper limit depends on uarch, >> everyone is well south of 1KB AFAICS and sizes bigger than that are >> common. > > That unrolled loop is pretty pessimal and very much 1980s. > > It should be pretty easy to write a code loop that runs > at one copy (8 bytes) per clock on modern desktop x86. > I think that matches 'rep movsq'. > (It will run slower on Atom based cpu.) > > A very simple copy loop needs (using negative offsets > from the end of the buffer): > A memory read > A memory write > An increment > A jnz > Doing all of those every clock is well with the cpu's capabilities. > However I've never managed a 1 clock loop. > So you need to unroll once (and only once) to copy 8 bytes/clock. > When I was playing with this stuff about 5 years ago I found 32-byte loops to be optimal for uarchs of the priod (Skylake, Broadwell, Haswell and so on), but only up to a point where rep wins. > So for copies that are multiples of 16 bytes something like: > # dst in %rdi, src in %rsi, len in %rdx > add %rdi, %rdx > add %rsi, %rdx > neg %rdx > 1: > mov %rcx,0(%rsi, %rdx) > mov 0(%rdi, %rdx), %rcx > add #16, %rdx > mov %rcx, -8(%rsi, %rdx) > mov -8(%rdi, %rdx), %rcx > jnz 1b > > Is likely to execute an iteration every two clocks. > The memory read/write all get queued up and will happen at > some point - so memory latency doesn't matter at all. > > For copies (over 16 bytes) that aren't multiples of > 16 it is probably just worth copying the first 16 bytes > and then doing 16 bytes copies that align with the end > of the buffer - copying some bytes twice. > (Or even copy the last 16 bytes first and copy aligned > with the start.) > It would definitely be beneficial to align the target buffer in this routine (as in, non-FSRM), but it is unclear to me if you are suggesting that for mov loops or rep. I never tested regs for really big sizes and misaligned targets, for the sizes where hand-rolled movs used to win against rep spending time aligning was more expensive than suffering the misaligned (and possibly overlapped) stores. If anything I find it rather surprising how inconsistent the string ops are -- why is memcpy using overlapping stores, while memset does not? Someone(tm) should transplant it, along with slapping rep past a certain size on both. -- Mateusz Guzik <mjguzik gmail.com>