On 9/3/23, David Laight <David.Laight@xxxxxxxxxx> wrote: > ... >> When I was playing with this stuff about 5 years ago I found 32-byte >> loops to be optimal for uarchs of the priod (Skylake, Broadwell, >> Haswell and so on), but only up to a point where rep wins. > > Does the 'rep movsq' ever actually win? > (Unless you find one of the EMRS (or similar) versions.) > IIRC it only ever does one iteration per clock - and you > should be able to match that with a carefully constructed loop. > Sorry for late reply, I missed your e-mail due to all the unrelated traffic in the thread and using gmail client. ;) I am somewhat confused by the question though. In this very patch I'm showing numbers from an ERMS-less uarch getting a win from switching from hand-rolled mov loop to rep movsq, while doing 4KB copies. Now, one can definitely try to make a case the loop is implemented in a suboptimal manner and a better one would outperform rep. I did note myself that such loops *do* beat rep up to a point, last I played with this it was south of 1KB. It may be higher than that today. Normally I don't have access to this particular hw, but I can get it again. I can't stress again one *can* beat rep movsq as plopped in right here, but only up to a certain size. Since you are questioning whether movsq wins at /any/ size vs an optimal loop, I suggest you hack it up, show the win on 4KB on your uarch of choice and then I'll be happy to get access to the hw I tested my patch on to bench your variant. That said, as CPUs get better they execute this loops faster, interestingly rep is not improving at an equivalent rate which is rather funny. Even then, there is a limit past which rep wins AFAICS. -- Mateusz Guzik <mjguzik gmail.com>