Re: [PATCH v2] x86: bring back rep movsq for user access on CPUs without ERMS

Mateusz Guzik <mjguzik@xxxxxxxxx> · Fri, 1 Sep 2023 17:28:37 +0200

On 9/1/23, David Laight <David.Laight@xxxxxxxxxx> wrote:
> From: Mateusz Guzik
>> Sent: 30 August 2023 15:03
> ...
>> Hand-rolled mov loops executing in this case are quite pessimal compared
>> to rep movsq for bigger sizes. While the upper limit depends on uarch,
>> everyone is well south of 1KB AFAICS and sizes bigger than that are
>> common.
>
> That unrolled loop is pretty pessimal and very much 1980s.
>
> It should be pretty easy to write a code loop that runs
> at one copy (8 bytes) per clock on modern desktop x86.
> I think that matches 'rep movsq'.
> (It will run slower on Atom based cpu.)
>
> A very simple copy loop needs (using negative offsets
> from the end of the buffer):
> 	A memory read
> 	A memory write
> 	An increment
> 	A jnz
> Doing all of those every clock is well with the cpu's capabilities.
> However I've never managed a 1 clock loop.
> So you need to unroll once (and only once) to copy 8 bytes/clock.
>

When I was playing with this stuff about 5 years ago I found 32-byte
loops to be optimal for uarchs of the priod (Skylake, Broadwell,
Haswell and so on), but only up to a point where rep wins.

> So for copies that are multiples of 16 bytes something like:
> 	# dst in %rdi, src in %rsi, len in %rdx
> 	add	%rdi, %rdx
> 	add	%rsi, %rdx
> 	neg	%rdx
> 1:
> 	mov	%rcx,0(%rsi, %rdx)
> 	mov	0(%rdi, %rdx), %rcx
> 	add	#16, %rdx
> 	mov	%rcx, -8(%rsi, %rdx)
> 	mov	-8(%rdi, %rdx), %rcx
> 	jnz	1b
>
> Is likely to execute an iteration every two clocks.
> The memory read/write all get queued up and will happen at
> some point - so memory latency doesn't matter at all.
>
> For copies (over 16 bytes) that aren't multiples of
> 16 it is probably just worth copying the first 16 bytes
> and then doing 16 bytes copies that align with the end
> of the buffer - copying some bytes twice.
> (Or even copy the last 16 bytes first and copy aligned
> with the start.)
>

It would definitely be beneficial to align the target buffer in this
routine (as in, non-FSRM), but it is unclear to me if you are
suggesting that for mov loops or rep.

I never tested regs for really big sizes and misaligned targets, for
the sizes where hand-rolled movs used to win against rep spending time
aligning was more expensive than suffering the misaligned (and
possibly overlapped) stores.

If anything I find it rather surprising how inconsistent the string
ops are -- why is memcpy using overlapping stores, while memset does
not? Someone(tm) should transplant it, along with slapping rep past a
certain size on both.

-- 
Mateusz Guzik <mjguzik gmail.com>