Re: [PATCH v2 04/17] x86/cpu/intel: Fix the movsl alignment preference for extended Families

David Laight <david.laight.linux@xxxxxxxxx> · Tue, 11 Feb 2025 21:45:09 +0000

On Tue, 11 Feb 2025 12:26:48 -0800
Dave Hansen <dave.hansen@xxxxxxxxx> wrote:

> We should really rename intel_workarounds() to make it more clear that
> it's 32-bit only. But I digress...
> 
> On 2/11/25 11:43, Sohil Mehta wrote:
> > The alignment preference for 32-bit movsl based bulk memory move has
> > been 8-byte for a long time. However this preference is only set for
> > Family 6 and 15 processors.
> > 
> > Extend the preference to upcoming Family numbers 18 and 19 to maintain
> > legacy behavior. Also, use a VFM based check instead of switching based
> > on Family numbers. Refresh the comment to reflect the new check.  
> "Legacy behavior" is not important here. If anyone is running 32-bit
> kernel binaries on their brand new CPUs they (as far as I know) have a
> few screws loose. They don't care about performance or security and we
> shouldn't care _for_ them.
> 
> If the code yielded the "wrong" movsl_mask.mask for 18/19, it wouldn't
> matter one bit.
> 
> The thing that _does_ matter is someone auditing to figure out whether
> the code comprehends families>15 or whether it would break in horrible
> ways. The new check is shorter and it's more obvious that it will work
> forever.

For any Intel non-atom processors since the Ivy bridge the only alignment
that makes real difference is aligning the destination to a 32 byte boundary.
That does make it twice as fast (32 bytes/clock rather than 16).
The source alignment never matters.
(I've got access to one of the later 64-bit 8 core atoms - but can't
remember how it behaves.)

For short (IRC 1..32) byte transfers the cost is constant.
The cost depends on the cpu, Ivy bridge is something like 40 clocks.
Lower for later cpu.
(Unlike the P4 where the overhead is some 163 clocks.)
It also makes no difference whether you do 'rep movsb' or 'rep movsq'.

For any of those cpu I'm not sure it is ever worth using anything
other than 'rep movsb' unless the length is known to be very short,
likely a multiply of 4/8 and preferably constant.
Doing a function call and a one or two mispredictable branches will
soon eat into the overhead. Not to mention displacing code from the I-cache.
Unless you are micro-optimising a very hot path it really isn't worth
doing anything else.

OTOH even some recent AMD cpu are reported not to have FRSM and will
execute 'rep movsb' slowly.

I did 'discover' that code at the weekend, just the memory load to
get the mask is going to slow things down.
Running a benchmark test it'll be in cache and the branch predictor
will remember what you are doing.
Come in 'cold cache' and (IIRC) Intel cpu have a 50% chance of predicting
a branch taken (no static predict - eg backward taken).

Even for normal memory accesses I've not seen any significant slowdown
for misaligned memory accesses.
Ones that cross a cache line might end up being 2 uops, but the cpu
can do two reads/clock (with a following wind) and it is hard to write
a code loop that gets close to sustaining that.

I'll have tested the IP checksum (adc loop) code with misaligned buffers.
I don't even remember a significant slowdown for the version that does
three memory reads every two clocks (which seems to be the limit).

I actually suspect that any copies that matter are aligned so the cost
of the check far outways the benefit across all the calls.

One optimisation that seems to be absent is that if you are doing a
register copy loop, then any trailing bytes can be copied by doing
a misaligned copy of the last word (and I mean word, not 16 bits)
of the buffer - copying a few bytes twice.

	David