From: Linus Torvalds > Sent: 12 September 2023 21:48 > > On Tue, 12 Sept 2023 at 12:41, David Laight <David.Laight@xxxxxxxxxx> wrote: > > > > What I found seemed to imply that 'rep movsq' used the same internal > > logic as 'rep movsb' (pretty easy to do in hardware) > > Christ. > > I told you. It's pretty easy in hardware AS LONG AS IT'S ALIGNED. > > And if it's unaligned, "rep movsq" is FUNDAMENTALLY HARDER. For cached memory it only has to appear to have used 8 byte accesses. So in the same way that 'rep movsb' could be optimised to do cache line sized reads and writes even if the address are completely misaligned 'rep movsq' could use exactly the same hardware logic with a byte count that is 8 times larger. The only subtlety is that the read length would need masking to a multiple of 8 if there is a page fault on a misaligned read side (so that a multiple of 8 bytes would be written). That wouldn't really be hard. I definitely saw exactly the same number of bytes/clock for 'rep movsb' and 'rep movsq' when the destination was misaligned. The alignment made no difference except that a multiple of 32 ran (about) twice as fast. I even double-checked the disassembly to make sure I was running the right code. So it looks like the Intel hardware engineers have solved the 'FUNDAMENTALLY HARDER' problem. David - Registered Address Lakeside, Bramley Road, Mount Farm, Milton Keynes, MK1 1PT, UK Registration No: 1397386 (Wales)